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Abstract.  This  paper  develops  new  properties  of  the  core  function  on  a  residue  number  system  (RNS) 
which  allows  efficient  implementation  of  the  operations  of  comparison,  overflow  detection,  sign 
determination,  parity  determination,  scaling,  and  general  division.  Previously  these  operations  have 
been  considered  difficult  to  implement  in  high  speed  hardware  and  have  not  taken  advantage  of 
the  parallel  structure  of  a  residue  class  architecture.  In  1977,  Akushskii,  Burcev  and  Pak  introduced 
the  core  function  and  presented  algorithms  for  these  difficult  operations.  While  these  algorithms 
were  superior  to  previous  techniques,  the  evaluation  of  the  core  function  required,  in  general,  an 
iterative,  complex  procedure.  Moreover,  while  these  techniques  were  theoretically  attractive,  they 
were  unable  to  construct  a  methodology  for  determining  a  suitable  core  function  for  a  realistic 
moduli  set.  In  the  present  work,  a  new  method  for  evaluating  the  core  function  is  introduced.  This 
method  utilizes  a  redundant  modulus  and  its  computational  complexity  is  equivalent  to  that  of  .the 
first  iteration  of  the  method  of  Akushskii  et  at.  While  this  new  method  requires  additional  hardware 
to  carry  on  this  redundant  modulus  calculation,  it  provides  more  information  and  allows  more 
flexibility  than  the  previous  method.  New  and  much  more  efficient  algorithms  for  the  difficult  RNS* 
operations  are  developed.  In  addition,  new  structural  properties  of  the  core  function  are  developed, 
«nd  the  optimality  of  the  function  is  characterized.  The  selection  of  an  optimal  core  function  for  any 
moduli  set  is  cast  as  an  integer  programming  problen^Finally,  to  ensure  its  applicability  to  real  time 
signal  processing  hardware,  the  feasibility  of  implementing  the  core  and  other  residue  functions  in 
VLSI  circuits  was  investigated.  ett  was  determined  that  a  core  calculation  can  be  implemented  on  a 
single  chip  in  one  50  nsec  clock  cycle. 
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I.  INTRODUCTION 


The  Residue  Number  System  (RNS)  offers  the  potential  for  very  high  speed  integer  arithmetic  for 
signal  processing  applications.  However,  due  to  the  relative  difficulty  of  performing  such  operations 
as  sign  detection,  magnitude  comparison,  and  parity  checking  for  residue  encoded  numbers,  it  has 
been  impractical  to  embed  data  depedent  logical  branching  in  the  overall  computational  flow.  Since 
many  signal  processing  problems  require  the  reduction  of  large  volumes  of  data  to  a  few,  relatively 
simple  decision  paths,  the  RNS  has  had  an  insignificant  impact  on  commercial  or  military  signal 
processing  systems.  The  primary  goal  of  this  project  was  to  investigate  new  RNS  methods  to  satisfy 
the  nonlinear  signal  processing  requirements  of  the  Navy's  Ocean  Surveillance  Signal  Processing 
Program.  A  secondary  goal  of  this  effort  was  to  investigate  and  evaluate  recent  Soviet 
developments  in  the  use  of  the  RNS  for  high  speed  computation.  Initial  investigations  of  the  Soviet 
work  yielded  some  promise  of  a  solution  to  the  nonlinear  signal  processing  problem.  In  this  paper 
the  Soviet  work  is  described  and  new  developments  presented  which  include  new  algorithms 
permitting  hardware  implementation  of  the  nonlinear  operations.  This  solution  is  superior  to  any 
previous  RNS  techniques  and  is  competitive  with  a  binary  implementation  in  terms  of  computational 
latency,  throughput,  and  hardware  complexity. 

A  considerable  body  of  Soviet  work  exists  in  the  open  literature,  dating  back  to  the  mid  1960's.  This 
includes  well  over  fifty  papers  and  two  recent  books,  many  of  which  are  referenced  in  Miller  et  al  [3]. 
A  number  of  patents  have  been  issued  for  the  device  implementation  of  many  of  these  concepts,  and 
special  purpose  RNS  processors  have  been  built.  The  principal  investigators  include  I.  Akushskii,  V. 
Amerbaev,  V.  Burcev,  I.  Pak,  and  D.  Yuditskii.  Their  results  have  both  paralleled  and  diverged  from 
Western  work. 

In  the  late  1960's,  hardware  architectures  for  complex  residue  systems  were  developed,  utilizing  an 
isomorphism  from  the  complex  RNS  on  the  moduli  set  (p1  +  iqi....,pn  +  iqn)  to  the  real  RNS  on  the 


moduli  set  (p^  +  qi2 . ,pn2  +  qn2).  Much  of  this  early  work  concentrated  on  the  use  of  redundant 

residue  systems  for  error  detection  and  correction.  In  the  1970's,  subjects  considered  included  the 
use  of  non-coprime  moduli,  quadratic  residue  systems,  magnitude  comparison,  parity  detection,  RNS 
representations  of  rational  numbers,  fractional  multiplication  and  division,  rounding,  and  the  use  of 
optical  techniques  for  RNS  processing. 

A  recurrent  theme  in  much  of  the  work  from  the  mid  1960's  to  the  present  has  been  the  study  of 
positional  characteristics  of  residue  encoded  numbers.  A  positional  characteristic  of  a  number  is  a 
characteristic  easily  determined  from  a  positional  radix  representation.  These  efforts  were  directed 
at  defining  a  function  from  the  RNS  to  the  integers  which  could  be  easily  evaluated  and  would 
permit  efficient  sign  detection,  comparison,  and  parity  determination.  In  1970  Amerbaev  studied 
the  use  of  threshold  networks  for  determining  positional  characteristics,  leading  to  a  1973  patent 
with  Akushskii  and  others  for  its  hardware  implementation.  During  the  1970's,  several  techniques 
were  proposed  which  offered  detection  of  overflow  under  addition  and  other  positional 
information  when  using  specialized  moduli  sets. 

In  1977,  Akushskii,  Burcev,  and  Pak  introduced  the  notion  of  the  core  of  a  residue  number  [1],  [21. 
The  core  function  provides  an  easily  implemented  and  efficient  technique  for  performing  the 
traditionally  difficult  residue  operations:  sign  detection,  magnitude  comparison,  scaling,  parity 
determination,  overflow  detection,  and  extension  of  base. 

This  paper  reviews  the  concept  and  properties  of  the  core,  describes  algorithms  for  performing  the 
difficult  residue  operations,  presents  new  techniques  for  the  efficient  evaluation  of  the  core,  and 
characterizes  the  problem  of  selecting  an  appropriate  core  function  for  a  given  moduli  set.  For 
example,  comparison  of  two  numbers  in  a  residue  system  with  n  moduli  can  be  performed  in  3 
clock  periods  using  n  +  5  integrated  circuits  and  can  be  pipelined  to  achieve  a  20  MHz  data  rate. 


Similar  improvement  over  existing  methods  can  be  obtained  for  the  other  difficult  residue 
operations.. 

Section  II  defines  the  core  function  and  presents  a  number  of  its  properties.  Section  III  discusses  the 
computation  of  the  core.  For  a  properly  selected  core  function,  the  evaluation  of  the  core  at  most 
integers  within  the  RNS  range  is  easily  computed  from  its  residue  representation.  However,  in  some 
cases  this  calculation  yields  an  ambiguous  result,  and  a  more  complicated  algorithm  is  required  to 
evaluate  the  core.  In  Section  IV  a  new  algorithm  for  core  calculation  is  presented  which  utilizes  a 
redundant  modulus  to  avoid  these  difficulties.  Another  difficulty  for  the  practical  utilization  of  the 
core  has  been  the  selection  of  certain  fixed  coefficients  which  determine  a  particular  core  function. 
Section  V  shows  how  this  can  be  recast  as  an  integer  programming  problem  which  can  be  solved  by 
standard  tools  to  obtain  an  optimal  core. 

Section  VI  presents  algorithms  for  magnitude  comparison,  and  Section  VII  gives  algorithms  for 
general  division.  Two  algorithms  are  presented  in  each  section:  a  general  algorithm  for  an  arbitrary 
core  function  and  an  efficient  algorithm  for  a  suitably  linear  core.  In  Section  VIII,  the  VLSI 
implementation  of  the  core  function  and  other  RNS  operations  is  discussed.  The  goal  of  this 
investigation  was  to  ascertain  the  feasibility  of  implementing  several  basic  residue  operations  on  a 
single  chip.  Given  the  current  status  of  VLSI  design  tools,  the  most  efficient  approach  to  this 
feasibility  investigation  is  to  perform  complete  design  and  layout  of  prototype  VLSI  circuits.  This 
section  describes  coding  considerations,  logic  reductions,  and  the  VLSI  design  steps.  The  results  of 
this  work  affirm  the  feasibility  of  implementing  such  algorithms  as  a  core  evaluation  on  a  single  chip 
at  high  speed  and  throughput 

The  work  of  Akushskii  et  al.  has  continued,  focusing  on  the  use  of  the  core  function  to  perform 
floating  point  arithmetic  using  the  RNS.  In  this  setting,  an  integer  a  in  the  residue  system  actually 


represents  the  product  of  the  proper  fraction  a/M  with  a  power  of  2,  where  M  is  the  product  of  the 
moduli.  Algorithms  for  addition,  multiplication,  division,  and  binary  shifting  have  been  developed, 
and  a  patent  for  a  general  purpose  floating  point  arithmetic  unit  based  on  these  concepts  has 
appeared.  These  algorithms  are  described  in  the  Appendix. 


II. 


DEFINITION  AND  PROPERTIES  OF  THE  CORE  FUNCTION 


Let  m,,...r  mn  be  the  relatively  prime  moduli  of  a  residue  number  system  with  product  M.  For  any 
integer  a,  0  S  a<  M,  there  exists  a  unique  n-tuple  (a,)  =  (ai,  ....  a„)  with  0  £a,  <  m,-  for  which  a  s 
a;  (mod  mt).  In  this  case  we  write  a  =  (a-,).  If  b  is  an  arbitrary  integer  satisfying  b  s  a,-  (mod  m,- ),  we 
say  b  is  a  representative  of  (a,)  and  write  b  s  (aj  (mod  M).  A  residue  encoding  (a]  will  always  be 
assumed  to  lie  in  the  internal  [0,  M).  For  any  integers  a  and  m,  jalm  denotes  the  least  non-negative 
residue  of  a  modulo  m,  and  if  (a,m)  =  1,  ll/ajm  denotes  the  multiplicative  inverse  of  a  modulo  m. 

The  n-tuple  (a, . an)  determines  a  unique  integer  a,flSa<M.  Nevertheless,  unless  decoded,  the 

n-tuple  conveys  no  positional  information,  i.e.,  magnitude  comparisons  are  not  possible  from  simple 
direct  analysis  of  the  n-tuples.  It  is  of  interest  to  find  some  means  of  attaining  such  information  in  an 
economic  and  computationally  feasible  way. 

For  a  given  modulus,  say  m,.  an  integer  is  uniquely  determined  by  the  quotient  [a/m,]  and  the 
residue  ja  /mj  =  a,.  Furthermore  the  information  given  by  the  pairs  of  (quotient,  remainder)  are 
sufficient  for  magnitude  comparisons.  If  a  and  b  are  any  two  integers,  then  a  <  b  if  and  only  if 
[a/m,]  <  [b/m,]  or,  [a/m,]  =  fb/m,7and /a/m,  <  /b/Wj. 

Thus  if  both  [a/m;]  and  (a,)  are  represented  in  the  RNS,  comparisons  are  possible.  However,  this 
approach  is  not  practical  since  it  leads  to  calculations  on  domains  of  just  one  order  of  magnitude 
smaller  than  that  of  the  RNS,  viz.,  for  a  €  [0,M)  the  quotients  [a/m]  take  values  in  [ 0,M/m ].  An 
alternative  is  to  consider  a  simple  function  of  the  set  of  quotients  [a  /mj.  The  simplest  form  for  such 
a  function  is  an  integer  linear  combination  of  the  quotients.  This  is  the  form  of  the  core  function  as 
defined  by  Akushskii  et  al. 


Definition.  Let  Wf,...,vv„  be  fixed  but  arbitrary  integers,  not  all  zero.  The  core  C(a)  of  an  arbitrary 
integer  a  is  defined  as 


C(a)  =  2  tv. [aim  ], 


(D 


where  [•]  denotes  the  greatest  integer  function. 

The  core  coefficients  w,  are  fixed  for  the  moduli  set  and  do  not  depend  on  the  integer  a.  The 
selection  of  the  w,-  is  important  for  efficiency  of  the  core  calculation,  cf.  Sections  III  and  IV.  It  is 
desirable  that  the  w,-  are  chosen  so  that  the  range  of  the  core  function  on  [ 0,M )  is  small,  i.e.,  on  the 
order  of  the  individual  moduli  of  the  system.  It  is  not  obvious  from  the  definition  that  C(a)  is  easily 
computable  from  (aj,  since  the  definition  would  require  that  (ad  first  be  decoded  to  obtain  a,  then  n 
integer  divisions  be  performed,  and  finally  an  inner  product  of  length  n  taken.  It  will  be  shown  later 
that  the  core  can  be  obtained  directly  from  the  (aj  as  the  residue  of  an  inner  product. 

Note  also  from  the  definition  that  C(a)  is  a  step  function.  As  o  increases,  C( a)  remains  constant  until  a 
equals  a  multiple  of  any  of  the  moduli,  at  which  a  step  jump  occurs.  Note  next  that  if  all  the  w,-  are 
positive,  C(a)  is  a  non-decreasing  function  and  hence  contains  direct  positional  information; 
however,  C(M)  (and  hence  the  range  or  the  core  function)  will  be  quite  large,  and  evaluation  of  the 
core  function  becomes  computationally  intractable.  For  example,  if  each  w,-  =  1,  C(M)  =  2  w/m,-  = 
2  m„  where  m,-  =  /W/m,.  Thus  it  will  be  required  that  both  positive  and  negative  core  coefficients  are 
used,  resulting  in  a  core  function  which  is  no  longer  monotonic. 

Akushskii  et  al.  considered  the  core  a  "positional  characteristic"  of  a  residue  encoded  number.  For 
computational  reasons,  they  required  selection  of  the  w,'s  so  that  C(0)  ( =0)  and  C(M)  are  near  the 
minimum  and  maximum  cores  on  [0,M),  respectively.  Consequently,  the  cores  they  considered 
tended  to  be  nearly  non-decreasing  with  step  function  graphs  having  some  negative  steps  but 


generally  following  the  straight  line  from  (0,0)  to  (M,  C(M)).  Such  cores  will  thus  contain  global 
positional  information:  a<<  b  if  and  only  if  C(a)  <<  C(b).  Locally,  however,  such  a  core  does  not 
directly  distinguish  magnitudes.  Thus  in  selecting  the  coefficients  tv,-  for  a  core  function,  the  tradeoff 
between  compactness  of  range  and  monotonicity  must  be  balanced. 

We  will  demonstrate  a  modified  algorithm  for  computing  cores  which  does  not  require  the 
approximate  monotonicity  of  the  core  function.  Thus  highly  nonlinear,  “wild"  core  functions  which 
may  have  very  small  ranges  can  be  considered.  In  Section  IV  it  is  shown  that  such  wild  core  functions 
can  be  used  for  parity  determination,  extension  of  base,  and  scaling.  However,  for  sign  detection 
and  magnitude  comparison,  a  global  linearity  of  the  core  is  still  required. 

Figure  t  illustrates  the  graphs  of  four  cores.  The  first  graph  is  taken  from  an  example  in  [2)  having 
moduli  set  {7.9,1 1}  and  core  coefficients  {-1,-1,3}.  The  core  is  approximately  monotonic,  taking  its 
minimum  near  0  and  its  maximum  near  M  =  693.  Figure  1{b)  illustrates  a  core  for  the  same  moduli 
set  and  core  coefficients  {1.2.-4}.  For  this  moduli  set,  these  coefficients  give  rise  to  the  core  of 
minimal  range.  In  Figures  1  (c),  (d),  cores  are  shown  for  the  more  practical  moduli  set 
{23,25,27,29,31}.  The  core  coefficients  for  (c)  are  {-3,  -2, 4,  -1, 3}  and  for  (d)  are  {-1,8, -2,-4, -2}. 

In  this  section,  several  properties  of  the  core  are  derived  and  used  to  give  efficient  algorithms  for  the 
"difficult"  residue  operations.  Since  these  algorithms  require  calculation  of  the  core,  the  elegance 
of  the  algorithms  cannot  be  fully  appreciated  until  core  calculation  is  discussed  in  Sections  III  and  IV. 
Most  of  the  theorems  in  this  section  are  due  to  Akushskii  et  a!.,  although  the  proofs  have  been 
considerably  simplified  and  several  errors  corrected. 


(a)  moduli  set  ={7,9,11} 

core  weights  =  {-1,-1, 3} 


(b)  moduli  set  ={7.9,11} 
core  weights  =  {1,2,4} 


(c)  moduli  set  =  {23, 2S,  27. 29.  31} 
core  weights  =  (-3,  -2, 4, -1. 3}  . 


(d)  moduli  set  ={23,25,27,29,31} 
core  weights  =  {-1, 8,  -2,  -4,  -2} 


Figure  1 .  Graphs  of  four  core  functions  illustrate 
the  trade  between  compactness  of  range 
and  linearity. 


e  first  theorem  gives  a  formula  for  an  integer  a  in  terms  of  its  core  and  its  residue  representation, 
thus  provides  a  decoding  algorithm  which  is  similar  to  the  Chinese  remainder  theorem  with  the 
vantage  that  no  modulo  M  reductions  are  required. 

leorem  1  (Akushkii  et  al.).  If  a  is  any  integer  (not  necessarily  restricted  to  [0,M)),  and  C  (M)  *  0, 
en 


a  = 


M ■  C(a)  +  2  a .  ( m.w .) 

cm 


oof.  Since 
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m. 
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he  result  follows  by  solving  for  a. 

he  nonlinearity  of  the  core  arises  from  the  truncation  of  the  greatest  integer  function.  The  next 
leor  n  provides  a  measure  of  how  close  the  core  function  is  to  being  an  additive  homomorphism, 
his  theory  ivides  the  basis  for  many  of  the  ensuing  results. 

heorem  2  (Akushskii  et  al.).  If  a  =  fa, )  (mod  M)  and  b  =  (/?, )  (mod  M)  are  integers  (not  necessarily 
istricted  to  (0.M)),  then 

C(a  +  b)  =  C(a)  +  C(b )  +  2  we., 
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itical  core  is  found.  The  lift  stage  then  reverses  the  steps  of  the  first  stage,  by  removing  the  critical 
pect  of  the  core.  This  stage  assumes  core  functions  satisfy  some  degree  of  separability. 

•t  a  be  an  integer  with  critical  core.  In  the  descent  stage,  a  modulus  is  discarded  and  the  core  of  a  is 
>mputed  relative  to  the  reduced  moduli  set,  using  a  new  set  of  coefficients  w,  pre-selected  for  the 
duced  moduli  set.  if  this  core  is  critical,  another  modulus  is  discarded  and  the  step  repeated.  The 
ascent  is  terminated  when  a  non-critical  core  is  found.  The  descent  is  guaranteed  to  terminate  at  or 
efore  the  stage  at  which  the  reduced  moduli  set  has  only  two  moduli,  since  w(  and  w2  can  be  chosen 
j  that  no  critical  cores  are  present  (e.g.,  w(  =  1  and  w;  =  1,orw(=  0  and  w^  =  1). 

he  method  of  lifting  appends  iteratively  the  previously  discarded  moduli  and  determines  the  true 
are  at  each  stage.  Assume  the  core  is  not  critical  for  the  residue  encoded  integer  (ay...,  a(J  relative 
a  the  moduli  set  .  With  Theorem  9,  this  known  core  is  used  to  extend  this  integer  relative  to 

ae  new  basis  element  m  .  The  resulting  residual  /J  must  satisfy 

P  +  £-  (/rtj*  -  •  =  ai+l(mod  m.+  l)  ^ 

ar  some  integer  k  6  [0,  m  If  the  interval  [0,  m(  -  mitl)  is  partitioned  into  mi>(  consecutive 
itervals  containing  m;  -  m  integers  each,  k  determines  the  number  of  the  interval  containing 
i  ...,  a  ).  Thus  the  true  core  of  (a a  J  relative  to  m m  ,  can  be  determined  if  the  core 
unction  is  mj  + ,  -  separable.  If  the  core  function  is  relatively  linear  the  process  is  reduced  to  a  simple 
amparison:  if  k  is  less  than  a  predetermined  constant  K  then  the  true  core  is  that  which  is  close  to 
ero,  otherwise  the  true  core  is  that  which  is  close  to  the  core  at  the  product  m(  ...  m.tl.  In  the 
articular  case  where  the  constant  K  is  equal  to  [m,  +  j  /  2),  this  process  reduces  to  the  one  presented 
i(2I. 
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ore  is  relatively  linear  and  C(M)  is  small,  the  range  of  C  must  be  small.  Experiments  conducted  by 
he  authors  indicate  that  compactness  and  linearity  are  opposing  traits. 

Mcushskii  et  al.  introduced  the  method  of  descent  and  lift  to  determine  the  true  value  of  a  critical 
:o re.  Both  this  method,  and  the  redundant  modulus  core  calculation  (introduced  by  the  authors  in 
Section  IV),  assume  that  the  core  function  can  separate  integers  with  cores  that  are  different  but 
equal  modulo  C(M).  Akushskii  et  al.  tacitly  assume  that  if  a  and  b  are  two  integers  in  [0,M)  with 
C(b)  =  C (a)  +  C(M ),  then  a  <  M/2  and  b  >  M/2.  This  condition,  though  sufficient  to  solve  the 
problem  of  critical  cores,  may  be  too  restrictive  when  considering  the  selection  of  weights  wu  cf. 
Section  V. 

A  weaker  concept  is  introduced,  called  m-separability.  This  notion  resolves  core  ambiguities  and  will 
be  used  in  Section  V  in  the  selection  of  core  functions. 

Definition.  Let  s  2:  1  be  a  real  number.  A  core  function  Cf-J  is  said  to  be  s-separable  if 

M  I 

a  —  b  < — implies  C(a)  -  C(6)  <C(M). 

s  I 

If  a  core  function  is  m-separable,  assume  its  range  [0,M)  is  divided  into  m  consecutive  disjoint 

subintervals,  say  /f .  /m,  each  of  length  M/m.  Within  each  of  these,  cores  are  unambiguously 

defined;  i.e.  if  a  and  b  are  in  lj,  and  C(a)  s  C(b)  (mod  C(M)),  then  C(a )  =  C(b).  Thus  for  a€/y,  the 
mapping  !C(a)!c(M)-*C(a)  is  well  defined  and  can  be  implemented  by  table  lookup. 

The  method  given  by  Akushkii  et  al.  consists  of  two  stages:  descent  and  lift.  The  descent  stage 
iteratively  discards  a  modulus  and  evaluates  a  core  function  on  the  remaining  moduli  set  until  a  non- 


If  0  S  C  (a)  <  C  (M),  (4)  gives  the  actual  value  of  C(a);  otherwise,  the  core  has  been  reduced  modulo 
C(M). 


Cmin  =  min {C(a):0  £a  <M} 
C  =  max {C(o):0  So  <M} 

max  1  ‘ 


From  Theorem  3,  it  follows  that 


C  +  C  =  CUW)  -  Su.. 

mm  max  j 


Furthermore,  since  C(0)  =  0, 


The  range  of  values  the  core  function  takes  in  [O.M)  is  [Cmirt,  CmJ.  If  a  core  IC(a)/aM) is  calculated  by  (4) 
anc*  Cm„  -  <  [C(a)lc(M)  <  C(M)  +  Cmin,  then  C(a)  =  jC(a)lC(fji)  and  thus  the  core  has  been 

unambiguously  determined.  Otherwise,  (4)  yields  an  ambiguous  result.  Cores  having /CfaJ/aM)  in  the 
intervals  [0,  Cmtt  -  C(M)]  or  [Cmm  +  C(M),  C(M)]  are  called  critical  cores.  (It  should  be  emphasized  that 
the  result  of  the  evaluation  of  (4)  is  therefore  self  validating.)  Moreover,  most  practical  core 
functions  will  have  some  critical  cores. 

In  selecting  the  weights  for  a  core  function,  the  trade  between  compactness  of  range  and  linearity 
can  now  be  appreciated.  Since  (4)  involves  only  modular  arithmetic,  it  is  best  evaluated  using 
hardware  similar  to  that  for  other  residue  calculations.  Thus  C(M)  should  be  on  the  same  order  as 
the  moduli  of  the  system.  On  the  other  hand,  the  core  should  be  relatively  linear  so  that  Cmn  is  near  0 
an<*  is  near  C(M),  minimizing  the  number  of  integers  in  [0,  M)  having  critical  cores.  But  if  the 


Formula  (3)  follows  by  applying  the  definition  of  the  core  function  (1)  to  B;,  and  using  the  facts  that 
[B./nrij]  =  (Bj/mj)  for  j  *  i,  and  [B,  /  =  ((Bj -1)  /  m;>. 


Theorem  12  ( Chinese  Remainder  Theorem  for  Core  Functions,  Akushskii  et  al.).  If  a  =  fa,- )  is  an 
integer  in  [0,M),  then  its  core  is  given  by 


C(fl)  =  Ea.C(B.)-C(M)'  R  (a). 


Proof.  Using  the  rank  function  as  given  by  (2),  and  the  orthogonal  core  basis  given  in  (3),  the  core  of 
a  is  evaluated  as 


n  r  n  /  a  —  a .  \ 

C  (a)  =  2  ",  “  =  I  «,( - 0 

.  ,  Jim]  .  ,  J\  m  / 


j- l  j  j=l 


n  w  /  n  \ 

-  1  ^(1^-“'  *M—j) 


j=l  mi  '.=1 


n  /  n  W  W\  n 

=  y«l»y  j---««y  ».&, 

.  .  *\  1  ",  m.  m  ./  .  ,  J  J 


i=l  j  =  l  j  « 


V.t'. 

>  y,  . 

-  *  i 

J 


=  2 


In  a  residue  implementation  the  core  can  be  evaluated  modulo  C  ( M )  as 


C(o)  =  Ea.CIB.) 

C(M)  1  ‘  ‘  *C(An 


w*. 


The  next  lemma  gives  an  explicit  representation  of  the  elements  of  the  orthogonal  basis. 


Lemma.  The  elements  of  the  orthogonal  basis  are  given  by 


1 


Proof.  Since  Bje[OfM)  is  divisible  by  my  for  ally* /,  it  can  be  written  as  Bj-knij,  where  0  <  k  <  m,-. 
Furthermore,  Bj-1  is  divisible  by  m„  and  so  Arm,-  =  1  (mod  mj,  or  equivalently  k  s  j 1/m ,  /m. 

Substituting  these  values  for  the  B,  into  Theorem  1 1  gives 


a  = 


M  ’ 


a  slightly  different  formulation  of  the  Chinese  remainder  theorem  than  given  by  Szabo'  and  Tanaka 

[4],  Note  also  that  the  rank  function  R(a)  is  analogous  to  the  function  A(a)  defined  by  (4J 

a. 

a  =  Em.|~|  -  A(a)-  M 

‘  mi 

In  general  A(a)  <  R(a). 


The  values  the  core  function  assumes  at  the  orthogonal  basis  elements  perform  the  role  of  a  basis. 
This  is  given  by  the  Chinese  Remainder  Theorem  for  Core  Functions  which  expresses  the  core  of  an 
integer  in  the  range  of  the  RN5  as  an  inner  product  of  its  residues  with  the  'orthogonal  core  basis’ 
{C(BJ}.  These  coefficients  are  constants  of  the  RNS  and  are  precomputed  as 


C(B.)=  B. 


cm 

M 


(3) 
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COMPUTATION  OF  THE  CORE  FUNCTION 


The  definition  of  the  core  function  as  given  in  Section  I,  though  useful  for  the  derivation  of  its 
properties,  is  not  efficient  for  computations  within  the  context  of  residue  arithmetic.  Its  use  requires 
decoding  of  the  residue  representation,  n  integer  divisions,  and  finally  an  inner  product  of  length  n. 

In  this  section  it  will  be  shown  how  the  core  function  can  be  evaluated  directly  from  the  residue 
representation  of  an  integer  in  the  range  of  the  RNS.  This  result,  referred  to  as  the  Chinese 
Remainder  Theorem  for  Core  Functions,  gives  core  values  modulo  the  core  of  M.  This  leads  to  the 
presence  of  some  ambiguous  cases,  called  critical  cores.  Several  methods  are  presented  that  permit 
the  unambiguous  evaluation  of  these  cores. 

Let  Bj  =  (Pj  )  €  [0,M)  be  defined  by 


1  i—j 
0  otherwise 

The  set  Bt,  B2,  ...,  B„  is  called  an  orthogonal  basis  for  the  residue  system.  The  nomenclature  is 
justified  in  view  of  the  form  of  the  Chinese  Remainder  Theorem  as  stated  below  without  proof. 

Theorem  / 1  (Chinese  Remainder  Theorem).  If  a  =  (a, ),  then  a  =  jE  a,  Bj  /  m- 

The  difference  between  E  aj  B,  and  a  is  a  non-negative  integer  multiple  of  M.  This  number  R(a)  is 
called  the  rank  of  o  and  satisfies  the  relation 


TAVW-  ^7T*T*TTT  W.W.I,  T— r T  H".». 11  A’A  >  ■  <;■ 


t.  = 
J 


„  /  m .  x.  —  a.  \ 

—  m  C( — )  +  C(a)  -  y  w.  [  ) 

J  m .  .  .  *  V  m .  ) 

j _  t  *;  _ ‘ _ 


w . 


Since  0  <  ry  <  my,  the  right  side  may  be  reduced  modulo  my,  giving  the  result. 


Theorem  10  (Akust  ,  .  etal.).  Let  a  a  (ad  (mod  M)  and  let  p  satisfy  (C(M),  p)  =  1.  Then 


l«lp  = 


A 

«  w.m. 

i  t 


M 


J  |  -^1  •  q.  +  |  —  |  •  C (a) 
£i  C(M)  'p  *  C(M)  P 


Proof.  The  result  follows  immediately  from  Theorem  1  upon  reduction  modulo  p. 

Theorem  10  provides  a  new  approach  to  general  scaling.  If  a  =  (ad  is  to  be  scaled  by  p,  then  a/p  is 
computed  as 


a-  a 


P  '  m. 


i= 
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/-V-'.-.j 
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1  -  • .  t 
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•V\Vj 


.  • :  r  -> 


.  ■  * .  1  . 
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,  i  i  I  I^L^if.n  .A,.  ll.  ^ 


each  involve  modular  calculations  consisting  of  a  core  function  evaluation  and  an  inner  product  of 
(a/)  with  a  fixed  weight  vector. 


We  begin  with  the  following  lemma. 


Lemma.  If  a  =  (a,)  is  divisible  by  panda/p  =  (xi),  then 


(pt.  -  a.  \ 

~ir) 


C{a!p)  = 


Proof.  Since  a  =  a/p  +  -  +  alp  (p  summands),  by  iteration  of  Theorem  2, 


a 

C(a)  =  p-  C(-)  +  Zw.  — - -  . 

p  ‘  \  m.  J 


The  result  follows  by  solving  for  C(aJp). 


Theorem  9  (Akushskii  et  al.).  Let  a  =  (aj  be  divisible  by  my  (i.e.,  ay  =  0)  and  let  (wy,  mj)  =  1.  Then 


|~|  =  |— |  ( C(a)  +  Y  )  —  J  .a.) 

m.  j  w.  mj  \  m.  mj  */  j 


Proof.  Since  a.  =  0,  applying  the  lemma  with  p  =  my  gives 


.nV.V-V 
.v  < 


C(— )  = 
m . 

J 


(m .  x.  —  a.  \ 


where  (xj )  is  the  RNS  encoding  of  a/my.  Thus 


-.-'1 

-.-‘i 

*  ' 


Since  C(M)  *  0,  C(c)  *  C(a)  +  C(b)  +  E  w/c/. 

The  following  theorem  gives  a  test  for  sign  detection.  The  theorem  does  not  directly  appear  in  the 
works  of  Akushskii  et  al.,  but  is  a  useful  and  obvious  consequence  of  Theorem  6.  As  usual  in  a  signed 
RNS,  the  interval  (0,M/2)  represents  positive  numbers,  and  the  interval  (M/2,M)  represents  negative 
numbers. 

Theorem  8.  Let  the  moduli  m;  and  the  core  C(M)  be  odd,  and  let  (a-,)  be  the  residue  representation  of 
a  non-zero  integer  a  €  (0,M).  Then  a  represents  a  positive  integer  if  and  only  if 

<l2aiL  . l2QaU 

1  ft 

is  even. 

Corollary.  Let  the  moduli  m,  and  the  core  C(M)  be  odd,  and  let  a  *  b  be  integers  of  the  same  sign  in 
(0,M)  with  residue  representations  (aj  and  (Pi).  Then  b>  a  if  and  only  if 

<  |2<P1-al>I„, . I  2<P.  -  "A  > 

1  H 

is  even. 

The  next  two  theorems  provide  formulas  for  scaling  a  residue  number  by  a  modulus,  and  extension 
of  base.  Scaling  by  a  modulus  (or  a  product  of  several  moduli)  has  been  considerably  simpler  than 
general  scaling,  and  the  primary  computational  complexity  has  been  the  extension  of  base  relative 
to  the  modulus  (moduli)  used  for  scaling,  fundamentally  a  mixed  radix  conversion  process.  A  mixed 
radix  conversion  relative  to  n  moduli  is  performed  in  n-1  stages  and  no  general  techniques  are 
known  for  collapsing  the  process  into  fewer  stages.  The  algorithms  given  in  the  following  theorems 


Theorem  6  (Akushskii  et  al.).  Let  the  moduli  m,-  be  odd  and  let  (a,)  and  (/?,)  be  the  residue 
representations  of  integers  a,  b  €  [0,M).  Then  a  +  b  overflows  the  system  if  and  only  if 


i)  ( lai  *  Pi  l is  odd  and  a  and  b  have  the  same  parity;  or 


ii)  ( /a,-  +  Pi  I  m, )  is  even  and  a  and  b  have  differing  parity. 


The  second  method  for  overflow  detection  under  addition  requires  the  computation  of  the  core  of 
the  "sum*  in  two  ways:  overflow  occurs  when  the  results  differ.  This  theorem  does  not  require  that 
the  moduli  be  odd. 


:<r<$ 

V '*  •  .  Vi 

;V>V 

V  V  V 
•  *•*.** 

V  *.*  V 

>V-  .*■*. 


'  /  /, 
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Theorem  7  (Akushkii  et  al.).  Suppose  C(M)  *  0  and  let  a,b,  c  6  [O.M)  be  given  by  a  =  (aj,  b  =  (Pi ). 
c  =  (/a,  +  Pi  lm.).  Then  the  sum  a  +  b<  M  if  and  only  if 


C(c)  =  C(a)  +  C(6)  +  Eu/.e., 


where 


*  *  •*-  - 


f  ai +  Pj 

e.  =  -  =  0  or  1 . 


S  * 


Proof.  Clearly  a  +  b  <  M  if  and  only  if  a  +  b  =  c,  so  the  forward  direction  follows  immediately 
from  Theorem  2.  For  the  converse,  assume  a  +  b  S  M,  so  that  c  =  a  +  b  -  M.  Then  by  Theorem  2 
and  its  corollaries 


,  -V-  VN 


C(c)  =  C(a  +  6-Af)  =  C(a  +  6)  -  C{M) 


=  C(a)  +  C(b)  +  Z  w.  e.  -  Cm. 


wAV: 

V\v\ 

r 


■  » •  r 


-  V 


*•  vv;» 


have  the  same  parity. 


Proof.  Let 


so  that  c;  »  By  Theorem  2, 


C(a)  =  C ( -  +  ^)=2C(~)  +  2ui  e. 

2  2  2  *  ‘ 

so  that 

n 

C(a)~  ]T  w.t.  -  Wjpj 

C(al2)  =  - — - 

2 

From  the  definition  of  eu  a;  =  2/J,  -  e,  m,-.  Since  for  i  >  1,  m,-  is  odd,  e,-  =  0  if  a,  is  even  and  f,  -  1  if  a,-  is 
odd.  Thus  e,  =  is  the  parity  function  of  a,  for »  =  2, ....  n.  Thus  if  =  0,  CW  and 


n 


y  w.5. 
« < 


have  the  same  parity,  and  ifflj  =  1,  they  have  differing  parity  since  wj  is  odd. 

Two  theorems  are  given  next  for  the  detection  of  overflow  under  the  addition  of  two  numbers.  The 
first  uses  the  parity  information  from  Theorem  5  as  follows.  If  all  m ,■  (hence  M)  are  odd  and  a  sum 
a  +  b  is  formed  (where  0  £  a,  b  <M),  overflow  occurs  when  a  +  b  SIW,  and  the  resulting  residue 
representation  actually  equals  a  +  b  -  M.  Hence  overflow  occurs  exactly  when  the  parity  of  the 
result  modulo  M  is  different  from  the  parity  expected,  based  on  knowledge  of  the  parity  of  a  and  b. 


Proof.  From  Theorem  1 ,  C (M)  a  =  M  C(a )  +  E  a,-  w;  m,-.  Since  the  moduli  m,-  and  C(M)  are  odd.  a 
and  C(M)  ■  a  have  the  same  parity,  C(a)  and  M  C(a)  have  the  same  parity,  and  £a,  w,-  and  £  a-,  w,  m,- 
have  the  same  parity.  Since  a  sum  of  two  terms  is  even  if  and  only  if  the  summands  have  the  same 
parity,  the  result  follows. 

A  counterexample  to  the  above  theorem  is  next  given  in  the  case  that  the  hypothesis  that  C(M)  be 
odd  is  dropped.  Specifically,  we  show  that  if  C(M)  is  even,  C(a)  and  Ew,-  8j  can  have  the  same  parity 
for  odd  a.  This  establishes  the  necessity  of  the  hypothesis  omitted  in  [11  and  the  error  in  the  proof 
therein.  Let  {mr,  m2 }  =  {3,5}  and  let  {w?/  w^}  =  {1,-1}.  Then  C(15)  =  2  and  C(5)  =  0.  The  odd 
integer  5  is  encoded  as  (2,0),  and  since  both  components  are  even,  Ewj  Si  -  0,  completing  the 
example. 

Among  other  things,  Theorem  4  shows  that  in  the  case  where  all  the  moduli  are  odd,  scaling  by  2  can 
be  performed  by  checking  the  parity  of  a  =  (aj),  subtracting  1  from  each  a,  if  the  parity  is  odd,  and 
multiplying  each  a,-  by  the  modulo  m;  multiplicative  inverse  of  2.  The  computational  requirements 
for  checking  parity  thus  consist  of  a  core  calculation,  parity  checks  on  the  a;,  and  the  summation 
£w,5j. 


In  the  case  that  one  of  the  moduli  is  even,  the  parity  of  a  is  known  immediately.  Scaling  by  two  for 
the  odd  moduli  components  is  trivial,  but  calculation  of  the  scaled  result  for  the  even  modulus  is 
normally  performed  by  an  extension  of  base.  The  next  theorem  uses  the  core  to  compute  the  residue 
of  all  relative  to  the  even  modulus.  Without  loss  of  generality,  the  even  modulus  is  assumed  to  be  2. 
(This  theorem  is  actually  a  special  case  of  Theorem  9  below.) 

Theorem  5.  (Akushskii  et  al .).  letmj  =  2,  let  a  =  (dj  be  even  (so  aj  =  0),  and  let  w;  be  odd.  Let  all 
-  Wi)  and  let  8,-  be  the  parity  function  on  a;.  Then  fit  =  0  if  and  only  if  C(a)  and 
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Theorem  3  (Akushskii  etal.).  C(M  -  a  •  1)  =  (C(M)  -  ZwO  -  C(a). 


Proof.  Write  a  =  x/mj  +  a for  each  /  =  where  0  S  a-,  <  m,\  Then 


M  —  a  —  1  =(m.  —  x.  —  1)  m.  +  (m.  —  a.  —  1), 

«  «  i  i  i  ’ 


and 


0  S  m.  —  a.  —  1  <  m.  fori=l . n. 

I  l  l 


Thus 


C(Af  —  a  —  1)  =  S  w. 


M-a-l 


m. 


=  E  w  .(m.  -  x  .  -  1) 

i  t  t 


=  C(M)  -  C(a)  -  2  w.. 


The  next  theorem  characterizes  the  parity  of  a  residue  encoded  number  in  the  case  that  all  the 
moduli  are  odd.  This  theorem  provides  the  basis  for  sign  detection,  magnitude  comparison,  and 
detection  of  overflow  under  addition.  It  should  be  noted  that  both  the  hypotheses  and  the  proof  of 
the  converse  direction  of  the  equivalence  are  incorrect  in  [1],  where  the  statement  of  the  theorem 
does  not  require  that  C(M )  be  odd.  (A  counterexample  in  the  case  that  C(M)  is  even  is  given  below.) 


Theorem  4  (Akushskii  et  al.).  Assume  the  moduli  m,-  are  odd  and  the  core  coefficients  w,-  are  chosen 
so  that  the  core  C(M)  is  odd.  Let  a  =  (a,  )  (mod  M),  and  let  be  the  parity  function  on  a,-  (i.e.,  £,•  =  0  if 
aj  is  even,  <5*  s  1  if  a,-  is  odd).  Then  a  is  even  if  and  only  if  C  (a)  and  EwjSj  have  the  same  parity. 


1 1  -  i >■•... 
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Proof. 


C(a  +  6)  =  Ew. 


a  +  b 


m. 

i 


(a  +  b  —  a  —  B.  +  e.m.  \ 
- 


(a  —  a.  \  / 6  —  p.  \ 

- -  )  +  S  w .  (  - -  I  +  S  w .  e. 

m  )  *  \  m.  )  11 


=  C(a)  +  C(6)  +  E  u;.e.. 


Corollary.  C(-a)  =  -  C (a)  +  E  w,-  c„  where 


e.  = 


i 


-1  or  0. 


Corollary.  C( a  -  6J  =  Cfa)  -  CffaJ  +  E  w;  a,  where 


e. 

2 


-l  or  0 


A  A  A 

Corollary.  I  fa/-/?,  =  0  for  /'=  then  Cfa  +  b)  =  C(a)  +  C(b).  In  particular,  Cfm,-  +  mj)  =  Cfm,)  + 
Cfm;) for  1  £  ixj  <  n. 


The  next  result  characterizes  the  symmetry  of  the  core  and  can  be  interpreted  as  saying  that  the  core 
is  approximately  odd  in  [0,M)  with  respect  to  the  midpoint  of  the  interval.  This  theorem  will  be  used 
to  derive  a  bound  on  the  range  of  Of-)  on  [0,M),  important  for  the  selection  of  the  w,-. 


The  following  examples  are  given  to  illustrate  the  techniques  of  Akushskii  et  al.  The  moduli  set 
chosen  for  all  examples  is  m1  =  7,  m2  =  9,  m3  =  11,  and  thus  M=  693.  The  coefficients  of  the  core 
function  are  w(  =  -1,  W2  =  -1,  and  W3  =  3.  The  orthogonal  basis  elements  are  8,  =  99,  82=  154,  and 
83  =  441.  The  cores  C(M)  =  13,  and  C(Br)=2,  C(B2)  =  3,  and  C(B3)  =  8  are  computed  from  the  definition 
(1).  Using  (4),  the  core  of  a  =  (ai,  a?,a->)  is  given  by 

|C(a)|t>  =  |2a,  +  3«2+8aJ13  161 

for  0<  a  <  693,  the  minimum  core  is  Cm,n  =  - 2,  and  the  maximum  core  is  Cmax~  14.  Hence  the  only 
critical  cores  as  computed  by  (6)  are  lC(a)ji3  =  0,  1 ,  1 1  or  1 2. 

Example  1.  Determine  the  parity  of  a  =  (2,1,1). 

From  (6),  /C  (a)j  ,3  =  2.  Since  this  is  a  non  critical  core  it  follows  that  C(a)  =  2.  The  sum  £w,  <?,  =  2, 
where  6 )  is  the  parity  of  a,.  Since  Ewj  <5}  and  C(a)  have  the  same  parity  it  follows  from  Theorem  4  that 
is  a  is  even.  (The  decoded  value  of  a  is  100.) 

Example  2.  Determine  the  sign  of  a  =  (3,5,5). 

Let  1 2  a  I m  =  ( |2-3|7.  |2  5|9,  |2-5|n)  =  (6,1,10).  The  core  of /2  a/M  is  4.  The  sum  rw,<?„  where  5,  is  the 
parity  of /2aJ m  is  odd.  Hence  f2  a  /m  is  odd  and  so  it  represents  a  negative  integer.  (The  decoded 
value  of  a  is  500  €  (MI2.  M),  representing  the  M-complemented  integer  a-M  =  - 193.) 

Example  3.  Extend  a  =  (6,8,4)  to  base  p  =  5. 


From  (6),  C(a)  =  3.  Then  using  Theorem  10 


a 


(-1)-  99 

(-D-  77 

3.63 

693 

•  6  + 

•  8  + 

- - — 

•  4  + 

—  — 

•  3 

13 

5 

13 

5 

13 

5 

13 

5 

(The  decoded  value  of  a  is  125.) 

Examples  Compute  the  core  of  a  =  (1,8,8). 

From  (6 ),/C(a)/rj  =  12,  a  critical  core.  The  method  of  descent  and  lifting  is  applied. 

Decent. The  core  of  (1,8)  relative  to  m,  =  7,  m2  =  9  is  computed  for  the  core  coefficients 
w,  =  w2  =  1 .  The  orthogonal  basis  elements  are  computed  as  B ,  =  36,  B2  =  28  with  cores 
C(B,)  =  9,  C(B2)  =  7,  and  C(M)  =  16.  Then 

C(l,8)  =  1 9-  1+7-  8 1 16  =  1 . 

This  is  not  a  critical  core  since  the  2-moduli  system  has  no  critical  cores. 

Lifting.  Theorem  10  is  applied  with  p=  1 1,  observing  that  /  1/C(M)  /u  =|1/16|n  =  Then  (5,  the 
extension  of  { 1 ,8)  to  the  base  p  =  1 1  is  given  by 

p  =  1 1-  9-  9|u-  l+|l-  7-  9|u-  8  +  (63-  9 1  u-  1  (mod  11) 

=  1 4-  1+8-  8+6-  l|u  =8. 

Equation  (5)  is  then  solved  to  obtain  k  =  0.  Since  C  is  11- separable,  this  implies  that  a  =  (1,8,8)  is  in 
the  first  subinterval  of  [0,693)  and  hence  C  (a)  =  fC  (a)[  c(M)  *  C  (M)  =  12  -  13  =  -1.  (The  decoded 
value  of  a  is  8,  the  core  of  which  can  be  directly  computed  from  the  definition  (1),  yielding  the  same 


result.) 


IV.  REDUNDANT  MODULUS  CORE  CALCULATION 


The  use  of  the  core  function  for  performing  the  difficult  RNS  operations  is  practical  only  if  the  core 
function  can  be  efficiently  evaluated.  While  the  method  of  Akushskii  et  al.  is  efficient  for  non-critical 
cores,  the  methods  for  the  unambiguous  evaluation  of  critical  cores  can  be  quite  cumbersome.  In 
particular,  critical  core  evaluation  could  not  be  embedded  in  a  flow-through  architecture  since  the 
length  of  the  computational  path  depends  upon  intermediate  results.  In  this  section  we  introduce  a 
redundant  modulus  which  eliminates  critical  cores  altogether.  Calculations  modulo  this  redundant 
modulus  must  be  performed  for  all  computations  upstream  of  a  required  core  evaluation,  so  that 
when  the  core  evaluation  C(a)  is  required,  the  residue  of  a  modulo  the  redundant  modulus  is  known. 
The  core  evaluation  is  then  simply  an  inner  product  of  the  residue  components  of  a  with  fixed 
weights,  where  the  arithmetic  is  performed  modulo  the  redundant  modulus. 

Let  C  and  C  be  the  minimum  and  maximum  cores  on  [0,  M)  as  before,  and  choose  a  new, 

mm  max 

redundant  modulus  m  ,  >  C  -  C  with  (m  m)  s  1  for  i  =  1 . n.  Then  for  a  €  [0,  M), 

n*i  max  min  n  *  •  r 

IC(a)/mntl  will  uniquely  determine  C(a)  by 

if  I C  (a)  I  m„  „ ,  ^  mn  + 1  +  Qn/iv 

otherwise. 

As  shown  in  Theorem  1, 

M ■  C(a)  = '£a.{—wi)mi  + a-  C(M). 

Solving  this  equation  for  C(a)  in  the  ring  of  integers  modulo  mn  + 1,  it  follows  that 


Wk., 

Cfa)  = 


-26  - 


All  coefficients  can  be  precomputed,  and  thus  the  core  is  calculated  as  a  modulo  m  inner  product 
of  (at,  with  a  fixed  vector. 

Next  note  that  the  extended  RNS  on  { m ^  mnt,}  uniquely  represents  integers  in  the  interval 

[0,  M  mn  t).  We  emphasize  that  (7)  is  guaranteed  to  uniquely  determine  C(a)  only  if  a  ~  (a{, . . .  ,'a^ 
antt)  €  [0,  M).  By  chosing  the  redundant  modulus  to  be  larger  than  the  length  of  the  range  interval 
of  C  on  an  expanded  domain  (e.g.»  [0,  2M)),  all  cores  of  integers  in  this  new  domain  can  be 
unambiguously  computed  as  well. 

With  the  provision  that  a  €  (0,  M ),  redundant  modulus  core  calculation  can  be  used  directly  in  the 
algorithms  for  decoding  (Theorem  1),  parity  determination  (Theorems  4  and  5),  scaling  by  a  modulus 
(Theorem  9),  and  extension  of  base  (Theorem  10),  and  highly  non-linear  cores  will  suffice  for  these 
applications.  However,  Theorems  6  and  8  (overflow  under  addition  and  sign  detection)  require  the 
capability  to  compute  C(\alM)  for  a((0,  M),  as  they  take  advantage  of  parity  changes  upon 
wraparound  modulo  M.  In  the  redundant  RNS,  wraparound  will  not  occur  under  the  addition  of  two 
integers  a,  b  €  [0,  M),  and  the  result  will  be  uniquely  represented  as  an  integer  in  [0.  2M)  C 
+  New  algorithms  are  formulated  below  for  applying  the  redundant  modulus  core 
calculation  to  these  problems.  A  relatively  linear  core  function  is  required  for  practical 
implementation. 

Let  C  and  C  be  as  above  and  let  C#  and  C*  be  the  minimum  and  maximum  values  of  C  on 

m,n  *****  mtn  max 

[M.  2M),  respectively.  (Thus  Cmin  =  Cmin  +  C(M)  and  CmaK  =  Cmax  +  C(M)  .)  The  proof  of  the 
following  theorem  is  trivial. 


Theorem  13.  Let  a,  b  €  [0,  M).  Then  a  +  b: 


i)  overflows  [0,  M)  ifC  (a  +  b)  >  C  .  and 


ii)  does  not  overflow  [0,  M)  ifC(a  +  b)  <  C* 


If  c*  <  C(a+b)  <  C  ,a+b  may  either  overflow  or  underflow. 


The  ambiguous  case  is  analogous  to  the  ambiguity  present  in  the  methods  of  Akushskii  et.  al.  when  a 
critical  core  arises.  However,  far  fewer  ambiguous  cases  arise  in  the  application  of  Theorem  13  than 
in  the  application  of  overflow  detection  using  Theorem  6  or  7  above.  For  Theorems  6  and  7,  C(a), 
C(b),  and  C(ja  +  bj/^)  must  be  evaluated.  Thus  the  following  cases  are  ambiguous,  which  can  be 
resolved  by  the  method  of  descent  and  lifting. 


(ji  C(a)sC  -  C(M)orC(a)  >  C(M)  +  C  .  (i.e.,  the  core  of  a  is  critical) 

v/  max  mm 


(ii)  C(4)^C  -C(M)orC{b)>C(M)  +  C  .  (i.e.,  the  core  of  bis  critical) 

max  mm 


(iii)  C  (a  +  6)  tS  ^max  —  C  (M) 


(iv)  Cm  +  C  sS  C(a  +  6)SC 

mm  max 


(v)  C  (a  +  6)  2  2  •  C(Af)  +  C  . 


.  .V;*.' 

-  .--V*'] 


.iV.  >'.V. 


^  / 


(Conditions  (iii)  -  (v)  are  equivalent  to  the  statement  that  C((a  +  bfru)  is  critical.)  Of  these  possible  five 
ambiguous  cases,  only  (iv)  remains  an  ambiguous  case  for  Theorem  13.  For  this  case,  the  method  of 
descent  and  lifting  can  be  applied  as  well. 

Note  next  that  Theorem  13  requires  significantly  fewer  calculations  than  either  Theorem  6  or 
Theorem  7,  since  only  C(a+b)  must  be  computed  and  compared  with  two  precalculated  constants. 
Moreover,  no  restrictions  on  the  parity  of  the  m.  or  C  (M)  are  required. 

Finally,  to  apply  Theorem  13  using  the  redundant  modulus  core  calculation,  mot)  must  be  chosen 
satisfying  mn  J  >  C*  -  so  that  all  cores  of  integers  in  the  interval  (0,  2M)  can  be  computed. 

We  next  develop  algorithms  for  sign  detection  and  magnitude  comparison  using  the  redundant  core 
calculation.  These  algorithms  also  involve  some  ambiguous  cases,  but  these  cases  are  shown  to 
coincide  with  those  cases  for  which  the  problem  is  ill-conditioned.  If  the  RNS  represents  signed 
integers,  calculation  of  the  sign  of  a  is  ill-conditioned  if  a  is  near  M/2,  and  comparison  of  two 
numbers  a  and  b  is  ill-conditioned  if  either  a  or  b  is  near  M/2.  The  dynamic  range  of  an  RNS  should  be 
sufficiently  large  so  that  the  computational  outputs  can  be  unambiguously  interpreted,  and  a  buffer 
zone  around  M/2  should  be  maintained.  Calculations  should  avoid  this  interval,  since  small 
perturbations  of  numbers  in  this  interval  can  cause  large  changes  in  the  values  they  represent. 

Theorem  14.  For  a  i  [0.  M),  if  C(2a)  <  C*  then  a  €  [0,  M/2),  and  if  C  (2a)  >  C  ,  then  a  €  (M/2.  M). 

mm  “M 

If  C  S  C (2a)  <  C  ,  no  conclusion  can  be  drawn. 

mm  mjf 

Proof.  If  C(2a)  <  C#  then  2a  C  (0,  M).  If  C  (2a)  >  C  ,  then  2a  i  [M.  2M). 

mm  aim 

The  buffer  zone  around  M/2  which  should  be  avoided  can  now  be  given  explicitly  as 


For  example,  for  the  core  function  shown  in  Figure  1(a)  (M-  693),  this  buffer  zone  is  the  interval 
[264, 434],  As  before,  use  of  the  redundant  core  calculation  for  Theorem  14  requires 
m  ,2C*  -C  . 

To  perform  comparison  of  two  numbers,  the  following  lemma  is  required. 

Lemma.  For  any  integers  a  and  q,  C(qM  +  a)  s  C(a)(modq). 

Proof.  By  Theorem  2,  C(qM  +  a)  =  C(qM)  +  C(a)  =  qC(M)+C(a)  =  C(a)  (mod  q). 

If  a  =  (a,  . . .,  antl).  b  =  . ...  nt)  6  [0,  M)  and  b-  a  <  0,  then  in  the  redundant  RNS  (/?  -  a)  = 

+  (b-a).  Hence  by  the  lemma,  calculation  of  the  core  of  ({}.•  a )  modulo  mn  i  will  uniquely 
determine  C (b-a). 

Algorithm  for  Comparison.  Let  a,  b  €  [0,  M),  with  [0,  M/2)  representing  positive  numbers  and  (MI2, 
M)  representing  negative  numbers,  as  usual.  Compute  the  signs  of  a  and  b  using  Theorem  14.  If  the 
signs  are  different,  the  comparison  is  complete.  Otherwise,  compute  the  sign  of  b  -  a  to  complete 
the  comparison. 

Notice  that  b-a  will  be  near  M/2  only  if 

(i)  one  of  a  or  b  is  near  M/2  and  the  other  is  near  0  or  M,  or 


(ii)  neither  is  near  M/2  and  one  is  positive  and  the  other  negative. 


Comparison  of  a  and  fa  fails  only  under  condition  (i),  an  ill-conditioned  case. 


Example5.  Compute  the  core  of  a  =  (a/,  02,03,  aj  =  (1,  8,  8, 8)  using  the  same  non-redundant  moduli 
and  core  function  for  the  examples  of  Section  III,  and  using  a  redundant  modulus  of 
m4  =  32. 


Using  formula  (7),  the  core  of  a£[Q,M)  is  given  by 


;3 

•  -  -  v'-J 


IC (a) 1 32  =  1 23-  +  25-  a2  +  23-  a3  +  25-  aj^. 


AM 


Hence  C(a)  =  |  607  I32  =  31,  which  uniquely  determines  C(a)  =  jC  (a)  I32  -  32  =  -1.  Thus  a  critical 
core  is  avoided  and  the  descent  and  lift  algorithm  is  unnecesary. 


Functional  hardware  designs  for  performing  scaling  and  thresholding  using  the  redundant  modulus 
core  calculator  are  next  discussed.  These  are  included  to  give  an  appreciation  of  the  hardware 
simplicity  and  short  latency  times  for  performing  the  "difficult"  RNS  operations  using  cores.  Similar 
designs  for  other  operations  can  be  readily  conceived. 


These  designs  make  use  of  several  custom  VLSI  circuits  designed  and  currently  being  fabricated  by 
the  authors,  to  be  reported  elsewhere'.  These  circuits  permit  calculations  of  the  form  E  CjOi(modm) 
for  fixed  weights  c  and  as  many  as  six  inputs  a.  The  calculation  is  performed  in  a  single  50  nsec  clock 
cycle.  Thus  these  circuits  permit  a  one-cycle  calculation  of  the  core  and  a  one-cycle  calculation  of  the 
other  inner  products  required  for  the  theorems  above. 


'  Design  services  and  silicon  compiler  provided  by  Seattle  Silicon  Technology  Incorporated 


"*  A  *  A*  A  *. 


/  /. 


Shown  in  Figure  2  is  a  hardware  architecture  for  scaling  by  a  fixed  integer  q  with  (q,  m}  =  1.  Scaling 
is  performed  by  extension  of  base  to  the  modulus  q,  subtraction  (yielding  an  integer  evenly  divisible 
by  q),  and  multiplication  by  the  multiplicative  inverse  of  q.  The  residue  components  (a)  are 
simultaneously  input  to  a  core  calculator  and  an  inner  product  calculator  which  computes  the 
modulo  q  inner  product  required  in  Theorem  10.  The  outputs  of  these  two  devices  are  then  input  to 
a  programmable  read-only  memory  (PROM)  which  completes  the  extension  of  base.  This  result  is 
then  fed  to  a  PROM  which  computes  a  subtraction  followed  by  a  multiplication  with  the 

multiplicative  inverse  of  q.  Thus  n+  4  devices  are  required,  and  the  latency  of  the  calculation  is  three 

* 

clock  cycles.  The  procedure  can  be  pipelined  to  produce  a  new  scaled  result  each  clock  perod. 

Figure  3  illustrates  an  architecture  for  comparison  of  two  integers  a  =  ( a j),  b  =  (Pi)  s  [0,M).  In  the  first 
clock  period,  the  cores  C(2a)  and  C(2b)  and  the  residue  representation  of  2(a-b)  are  calculated.  In  the 
second  clock  cycle,  the  algebraic  signs  sgn(a)  and  sgn(b)  are  found  as  in  Theorem  14  using  a 
programmable  array  logic  (PAL)  device,  and  the  core  C(2(a-b))  is  formed.  In  the  final  clock  cycle  the 
logic  of  the  algorithm  for  comparison  is  implemented  in  a  PAL. 


ai  a2  -  an  an+  i 


Scaling  an  integer  a  =  (  ai  ,...,an+ 1)  €  [0,  M)  by  a  fixed  scale  factor  q  requires 
n  +  4  integrated  circuits,  has  a  latency  of  3  clock  periods,  and  can  be  pipelined 
with  throughputs  of  20  MHz. 


n  Qn  +  1 


Pi  P2  "Pn  Pn+ 1 


igure3.  Comparison  of  integers  a  =  (ai,...,  an+i),  b  =  (Pi,...,  f5n+l)  €  [0,M)  requires 

n  +  5  integrated  circuits,  and  has  a  3  clock  period  latency  and  20  MHz 
throughput. 


V.  SELECTING  A  CORE  FUNCTION 


Practical  implementation  of  residue  class  core  functions  is  contingent  upon  creating  a  methodology 
that  finds  an  appropriate  function.  For  a  given  moduli  set,  such  a  method  should  lead  to  a  core 
function  which  solves  problems  such  as  parity  and  sign  determination,  allows  scaling  and  basis 
extension,  but  which  has  small  enough  range  to  make  its  evaluation  practical. 


Akushskii  et  al.  discuss  some  partial  solutions  to  the  selection  of  the  core.  Nevertheless,  the  general 
problem  is  not  undertaken.  In  this  section  a  general  methodology  is  presented  which  allows  one  to 
find  the  core  which  is  "optimal"  for  the  application  at  hand.  First  a  decomposition  of  the  core 
function  is  introduced  and  used  to  find  an  upper  bound  of  the  core  range.  A  summary  is  then  given 
of  all  conditions  the  core  function  should  satisfy  to  maximize  its  utility.  Finally  the  selection  of  core 
functions  is  formulated  as  an  integer  optimization  problem. 


The  analysis  of  the  variability  of  a  core  function  is  possible  by  first  decomposing  it  into  a  linear  part 
and  a  periodic  part  of  period  M.  Let 


C  (a)  =  L  (a)  +  P  (a) 


where 


L(a)  =  2w.(a/m.)  =  aC(M)/M 


P  (a)  =  —  S  (w.a.l  m.) 

I  I  l 


the  weights  of  the  core  function  are  chosen  so  that  C(M)  is  positive,  the  linear  part  is  increasing  and 
:s  minimum  and  maximum  are  attained  at  0  and  M-1  respectively.  The  extrema  of  the  periodic  part 
allow  immediately  by  separating  the  weights  w  into  groups  according  to  their  sign. 


P 

min 


m-1 
(—10.)  — - 

1  m 

i 


m  —  1 

(-<*,)— — , 
1  m 

i 


where 


=  {1  <  i  <  n:  wj  >  0 } , 
/  -  =  {1  <  i  <  n  :  Wj  <  0}. 


Bounds  for  the  extrema  and  range  of  the  core  function  follow  immediately  from  the  decomposition. 
These  results  shall  be  used  to  define  the  optimization  criterion  to  find  an  applicable  set  of  weights. 


Theorem  15.  The  extrema  and  range  of  the  core  function  satisfy  the  following  inequalities 


C 

min 


^  -r-pmi  i, 

min 


c 

max 


p  + 

max 


M-1 

M 
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Lemma:  Let  0  <  a  <  b  <  M.  If  C(b)  -  C(a)  >  C(M),  then 


M 

b  -a  2  — —  (Cm  -  P  +P  ). 

cm  max  min 


of.  For  any  a€  [0,  M  ], 


—  Cm  +  P  .  <  C (a)  <  —  C(M)  +  P  , 

mm  M  max 


■quivalently 
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max  C(M)  nun  C(Af) 


;  lemma  then  follows  by  using  the  upper  bound  for  a  and  the  lower  bound  for  6. 


sorem  16.  The  core  function  C  is  m-separable  if 


C(M)-P  +P  . 

max  min 

cm 
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e  quotient  shown  in  the  left  hand  side  of  Theorem  16  may  be  considered  as  an  index  which 
Dvides  a  measurement  of  linearity  of  the  core  function.  This  quotient  varies  between  -  <*>  and  1.  It 
ces  values  close  to  1  when  Pmax  -  Pmin  is  close  to  0,  i.e.,  C  (■)  is  close  to  its  linear  part  L  (■). 


e  last  two  theorems  and  the  results  in  Section  II  are  now  used  to  define  the  criterion  and 
nditions  for  the  selection  of  a  core  function.  In  Section  IV  the  redundant  modulus  is  chosen  larger 
an  the  range  of  the  core  function.  It  follows  that  a  practical  core  must  have  a  small  range,  so  that 
optimal  core  function  is  defined  as  one  having  minimal  range  over  all  core  functions  for  a  fixed 
jduli  set.  The  last  inequality  in  Theorem  15  is  used  to  define  the  functional  to  be  minimized.  The 
thors'  experience  indicates  that  the  bound  for  the  range  is  tight  when  close  to  the  optimal 
ution  and  that  the  process  actually  does  lead  to  core  a  with  minimal  range.  The  conditions  the 
'e  function  should  satisfy  depend  on  the  particular  application  at  hand.  The  ones  presented 
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Step  (i)  Verify  fa  *  0. 

Step  (ii)  Compare  r,and  fa.  If  r,  <  b,  [alb]  =  q,  and  the  algorithm  is  complete. 
Step(iii)  Write 


r.  Sr  (r  )-  r 
t  _  L  i  i 

6  “  S.(r  )•  b 

L  i 


sL(ri>  r. 

S£l  (SL(r‘]'  b)'  SL  (SL(r.)-  b )■  Sr  (r.  )■  b 


Define  q(  +  ,  =  q,  +  SLl  (SL  (r,)  b).  and  r,, ,  =  a-ql  +  ;b  Return  to  step  (i). 


rove  convergence  of  the  algorithm,  we  need  only  show  that  1  £  S(.r  fry  J-faJ  £  /r,  /  fa/.  By 
lition,  S(.(  S  1  always.  If  Sl  (r,)-b  >  Lj,  then  SL)  (S{_  (r,)-b)  =  1.  Otherwise 


ce  the  quotient 


6)'  W'  r. 


S,(r  )■  r 

_ L  i _ i _ 

Sr  (S,  (/•  )•  6)-  S.  (r .  )•  6 

Lx  L  i  L  i 


establishing  convergence. 


edification  of  the  above  algorithm  is  next  presented  which  has  faster  convergence  for  relatively 
jr  core  functions.  However,  this  algorithm  is  more  difficult  to  analyze,  and  a  general  proof  of 
'ergence  has  not  been  obtained  in  terms  of  the  parameters  involved.  In  the  first  division 
rithm,  the  sequence  of  quotient  estimates  q,  converged  monotonically  to  [alb];  in  the  next 
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min 


i  Section  VI,  for  any  integer  a,  0  <  a  £  L,  sL(C(a))  a  £  L  We  next  define  a  function  Sl[0,L1  -*  N 
( iteratively  in  terms  of  sL.  Let  a0  =  a  €  [  0,L  ]  be  given.  Let  c,  =  C(aj),  s,  =  sL  (c(),  and  define 
s  s,aj.  This  procedure  terminates  when  +  j  =  J,  s*  >  I,  and  we  take 

i)  =  %j%2  "  *ki  again  giving  Si_(a)  a  £  L.  We  claim  also  that  St  (a)-a  >  U2  -  APM  l  C(M).  To  prove 
,  assume  $t  (c)  =  f.  Then 
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as  if  x  has  St  (C(x))  =  1, 


X  >  (C(x)  -  P  )M/C{M) 

max 


L  AP-  M 

2  “  cm 


ce  by  the  definition  of  St,  Si_(Si(a)a)  =  1,  the  claim  is  established. 

:ume  the  redundant  modulus  is  chosen  such  that  all  cores  of  integers  in  [O.L]  can  be 
ambiguously  computed  by  the  redundant  modulus  method.  Define  Lf  =  L/2  -AP  M/ C(M) 

jo rithm  for  C  /ision  /.  Let  a,b  €  [  0,M).  The  quotient  [  albj  is  computed  iteratively  by  the 
lowing  steps.  For  initial  conditions,  take  qo  =  0,  ro  =  a. 


_  cn  _ 


\/  / 


I. 


ALGORITHMS  FOR  DIVISION 


any  number  system,  division  is  the  most  computationally  complex  of  the  fundamental  arithmetic 
jerations.  In  the  RNS,  division  algorithms  have  been  so  complex  that  the  system  has  been  judged 
^suitable  for  computational  processes  requiring  division.  RNS  division  algorithms  have  been 
esented  by  Szabo  and  Tanaka  (4),  and  more  recently,  division  algorithms  using  the  core  function 
ive  been  presented  by  Akushskii  et  al.  [7],  However,  even  the  core-based  division  algorithms  are 
ctremely  cumbersome,  and  cannot  be  utilized  in  a  high-speed  real  time  signal  processing  system. 

i  this  section,  a  new  core-based  algorithm  for  general  division  is  presented.  By  previous  standards, 
le  algorithm  is  very  efficient,  and  is  competitive  with  the  usual  binary  system  algorithms  for 
vision.  Two  algorithms  are  presented:  the  first  is  guaranteed  to  converge  for  any  RNS  and  core 
mction,  while  the  second  has  superior  convergence  properties  for  well  behaved  core  functions, 
oth  algorithms  are  iterative,  and  thus  are  perhaps  not  suited  to  flow-through  architectures  for  high 
)eed  signal  processing.  However,  the  algorithms  might  well  be  suited  to  a  microcoded  general 
urpose  RNS  arithmetic  unit. 

a  core  function  is  approximately  linear,  then  for  b  bounded  away  from  zero,  cUb  =  C(a)  /  C(b). 
loreover,  since  P(a)  =  C(a)  -  (C(M)  /  M)a  does  not  increase  with  a, 

tim  C(fea)  _  a 

k-makb)  b' 

/e  will  temporarily  utilize  the  expanded  dynamic  range  provided  by  the  redundant  modulus  to 
nprove  the  estimate  of  alb.  At  each  stage  of  the  algorithms,  an  improved  estimate  q  of  the 
uotient  is  found.  The  process  terminates  when  0  <  a  -  qb  <  b.  As  in  the  algorithms  for 
amparison,  core-derived  upward  scaling  of  an  integer  is  required.  We  define  a  scaling  function 
milarto  those  defined  in  Section  VI. 

et  an  integer  L  >  0  be  given  and  let  CL  =  max  (C(x)  /  0  S  x  <  L  }.  Define  an  non-increasing 
mction  sL ;  [ Cmiru  Cj*N-{0}  by 


( a  am  n  n  \  m 

2  - +  p  -  p  — 

V  m  min  ma*j  am 


a  +  (P  -P  ) - 

mtn  max  C(Af) 


a'  =  a-t(C(a)<S(P  -P  ) - 

mat  mm  C(M) 


6'  =  6  -  /(C(a))  <  2  (P  -  P  ) - 

max  mm  C(M) 


since  the  condition  on  the  cores  implies  that  Ib-a  /  <  (Pmax  ■  Pmj„  )  MIC(M).  Thus  a  sufficient 
condition  for  S(a\  b' )  >  2  is 


4  (A P ) -  S  L 

am 


Example7.  Let  the  moduli  set  be  {7,9,1 1}  with  core  coefficients  {*1, -1,  3}.  Then  AP  =  (6/7  +  8/9  + 
30/11)  a  4.47.  Since  5  (dp)  >  C(M)  =  13,  the  first  algorithm  cannot  be  used.  For  the  second 
algorithm  we  must  have 


4(4.47)-  693 

L  2  -  =  953, 

13 


so  the  redundant  modulus  m4  =  32  will  suffice.  Since  calculated  cores  of  30  and  31  represent  true 
cores  -2  and  -1,  respectively,  L  will  be  the  largest  number  whose  core  does  not  exceed  29;  viz, 
L  =  1517. 


We  compare  aQ  =  a  =  (3,8,10,11)  =  395  and  b  =  (5,1,1,13)  =  397. 


Both  functions  t  and  s  can  be  implemented  by  table  lookup  since  their  domain  is  small  if  the  range 
of  the  core  function  is  small. 


We  next  iteratively  define  a  function  S(a,b)  for  a,b,£  [ Q,L]  as  follows.  Take  a0  -  a,  b0  =  b.  Let 
c,  =  min  {C(cij),  CfbJ}  and  s,  =  s(Cj),  and  define  aj+i  =  Sj-ai,  bl+i  =  Si-bj.  This  procedure  is  iterated 
until  St  *  /  =  1,  Sk>  1  (usually  only  one  or  two  iterations  are  needed).  Then  S(a,b)  =  s0st  -  s*.  Thus 
S  (a,b)  a£  L,  S  (a,b)  b  <  L,  and  min  { s  (S  (a,b)  a),  s(S(a.b)  b)}  =  1;  i.e.,  based  only  on  knowledge  of 
the  cores,  S(a.b)a  and  S(a,b)b  have  been  upward  scaled  as  much  as  possible. 


Second  Algorithm  for  Comparison.  Assume  C(M)  >  Oand  let  a0  =  a  =  (ai),  b0  =  b  -  (PJ  £  [0,L].  ‘ 

(i)  If  a,-  =  Pj  fori  =  1,...,n,  then  a  =b. 

(ii)  If  C(ai)  -  C (bj)  >  AP,  then  a  >  b  and  if  C(bj)  -  C(ai)  3:  AP  then  b  >  a. 

(iii)  Let  c,  =  min  (C(ap,  C(bj)),  and  form  a /  *  a,  - 1  (cj,  bj  =  o,  - 1  (cj.  (Then  a/  and  b '  are 
nonnegative.) 


(iv)  Compute  aj  + ;  =  S(a\,  bj  a’,  bj  +  j  =  S(a',  b])  b\  and  go  to  (ii). 


m 


A  condition  on  L  (and  therefore  also  on  mn  + 1)  is  next  derived  which  guarantees  convergence  of  the 
algorithm.  This  is  done  by  examining  worst-case  behavior.  It  can  be  readily  seen  that  convergence  is 
ensured  if  each  S(aj,  bj)  >  1;  then  /  a,  + ;  -  b/ * ;  /  >  /  a/  -  bj  /,  and  the  hypothesis  of  step  (ii)  will 
eventually  be  satisfied. 


We  assume,  without  loss  of  generality,  that  C(a)  <  C(b)  and  /  C(a)  -  C (b)  /  <  AP.  Then 
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algorithm  is  iterative  and  requires  the  use  of  a  redundant  modulus.  The  redundant  modulus  must  be 
chosen  sufficiently  large  to  guarantee  convergence. 


As  in  Section  IV  the  redundant  modulus  is  chosen  to  satisfy  mn  +  i  >  Cmax  -  Cmm.  The  method  of 
redundant  modulus  core  calculation  can  then  be  used  to  unambiguously  determine  cores  on  an 
interval  [0,L]  D  [0,M),  where  L  depends  upon  mn  +  7.  Let  Q.  be  the  maximum  core  on  [O.LJ.  Given 
integers  a/,  fa,-  €  [ 0,L],  the  comparison  algorithm  will  find  integers  a,-  +  >,  f  £  [O.LJ  satisfying  (i)  a,  £ 
fa/  if  and  only  if  a,  *j  <  fa,  *  T;  and  (ii)  |  a,-  + » -  b,-  +  j  /  >  /a,-  -  bj  / .  This  is  iterated  until  the  first  step  k  for 
which  / C(ak)  -  C(bk) /  >  4P.  By  the  lemma  above,  the  comparison  can  then  be  determined. 


To  obtain  a,-*/  and  b/  y.;  from  a,  and  b„  a  subtraction  and  a  scaling  are  performed.  From  the  cores 
of  a  and  b,  a  number  is  computed  and  subtracted  from  both  a  and  b  giving  nonnegative  results. 
From  the  cores  of  these  differences,  a  scale  factor  greater  than  one  is  computed  and  multiplied  by 
the  differences.  These  results  will  be  guaranteed  to  lie  in  [O.L]  and  are  taken  as  o,>  j  and  bi  +  J. 
Hence  we  must  define  functions  on  cores  to  give  the  subtrahend  and  the  scale  factor.  As  mentioned 
above,  it  is  required  that  C(M)  >  0. 


Let  Z  denote  the  integers  and  N  the  natural  numbers.  Define  a  non-decreasing  function 
t:  [ Cmin,  CJ  •*  N  by  t(c)  =  max  {[( c  -  Pmax)  M/C(M)J,  0}.  Then  for  any  integer  a  >  0, 


[CM  \ 

0  s  f(C(a))  £  ——  .  a  +  P  )  =  < 

\  M  maP/ 


Define  a  non-increasing  function  s;  [Cmm.  CL  J  -*  N  -  {0}  by 


s(c)  =  max 


l-  cm 
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mm 


Then  for  any  integer  a,  0  <  a  £  L, 


(C(a)-P  )M 

_ min 

cm 


s  (C(a)  )•  a  £  L  since  a  £ 


If  the  cores  are  calculated  by  the  redundant  modulus  method,  then  by  the  lemma  preceding  the 
algorithm  of  signed  comparison  in  Section  IV,  C(a%  -  Pi, ....  an-  P„,  an  +  }  -  pn  +  1 )  =  C(d)  (mod  mn  + 1). 
Thus  redundant  modulus  core  calculation  can  be  used  for  the  algorithm. 

To  select  a  core  function  which  allows  this  comparison  algorithm,  the  linear  condition  5  AP<C(M) 
should  be  included  in  the  list  of  linear  conditions  for  the  integer  optimization  problem  formulated  in 
Section  V. 

Example  6.  Let  the  moduli  set  be  {23,25,27,29,31}  and  the  core  weights  be  {-3, -2,4, -1,3}  as  in  Figure 
1(c).  Then 

m—  1 

AP  =  £|ib.|  — -  «  12.51 

1  m. 

i 

and  C(M)  =  67.  Since  S  AP  <  67,  the  algorithm  applies. 

(i)  Compares  =  (6,0, 1,22,2)  and  b  =  (7,0,5,23,10).  Then  C(a)  =  6,  C(b)  =  24,  C(b)  -  C(a)  = 
18  >  AP.  sob>a.  The  values  are  a  =  1,000,000,  b  =  5,000,000. 

(ii)  Compared  =  (6,0,1,22,2)  and  b  =  (12,0,2,15,4). 

Then  C(a)  =  6,  C(b)  s  11,$ o/C(a)-  C(b)  /  <  AP.  Form  d  =  (6,0, 1,22, 2)  -  (12,0,2,15,4)  = 
(17,0,26,7,29).  C(d)  =  58  i  Z  AP +  Pmax,  so  b>a.  The  values  here  are  a  =  1,000,000,  b 
=  2,000,000. 

Note  that  the  condition  5  AP  <  C(M)  is  a  relatively  strong  linearity  condition:  while  the  core  of 
Figure  1(c)  satisfies  this  condition,  the  core  of  Figure  1(a)  does  not.  The  next  algorithm  for 
comparison  relaxes  this  linearity  condition  at  the  cost  of  increased  computational  requirements.  The 
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and  hence  a-b  >  0.  Since  C(a)  *  C(b),  a*b,  obtaining  the  strict  inequality.  To  show  (ii),  first  note 
that  for  any  integer  x. 


Thus 
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First  Algorithm  for  Comparison.  Assume  S  AP  <  C(M).  Then  integers  a  =  (a,),  b  =  (Pi)  i  [0,M)  can 
be  compared  via  the  following  steps. 


(i)  Compute  C(d )  and  C(b).  If  C(a)-C(b)  >  AP,  then  a  >  b,  or  if  C(b)  -  C(a)  2  AP,  then 
b>  a.  If  either  condition  holds,  the  algorithm  is  complete. 

(ii)  Otherwise,  /  C(a)  -  C(b)  /  <  AP.  Form  the  difference  d  =  (a,  -  Pi)  i  l 0,M)  and  compute 
c (d).  (f  C(d)  £  2  (Pm„  -  Pmin)  *  P max  then  b  £  a;  otherwise  b  >  a. 


The  validity  of  step  (i)  of  the  algorithm  follows  immediately  from  the  lemma.  For  step  (ii),  the 
lemma  implies  /a-b/s  2M  APIC(M).  Thus  d  lies  in  one  of  the  intervals  Ii  =  [0,  2M  AP I  C(M))  or 
l2  =  [M-2MAP I  C(M),M).  Since  max  {(C(x) I  xil ,}  £  2AP  +  PmtK  and  min  (C(x)  txil2}>  C(M)-2AP 
♦  Pmiry  it  follows  from  the  algorithm  hypothesis  that  C(Ii)  n  C( I2)  =  0.  Thus  if  C(d)  €  C (It),  di.Ii, 
and  if  C (d)  €  C (I2),  di  I2,  establishing  step  (ii). 


VI. 


ALGORITHMS  FOR  COMPARISON 


In  Section  IV,  an  algorithm  utilizing  the  redundant  modulus  core  calculation  was  given  for  the 
comparison  of  two  numbers  in  a  signed  RNS.  The  algorithm  requires  the  capability  to  determine 
whether  a  number  lies  in  the  upper  or  lower  half  of  [ 0,M ).  This  is  accomplished  by  maintaining  a 
buffer  zone  around  MI2  so  the  determination  can  be  made  by  an  examination  of  the  core.  In  this 
section,  two  new  algorithms  are  presented  for  comparison  in  an  unsigned  RNS,  and  no  restrictions 
on  the  numbers  are  assumed.  The  first  algorithm  is  computationally  simple:  to  compare  a  and  b, 
only  C(a),  C(b),  and  C(a-b)  are  required.  However,  a  linearity  condition  on  the  core  is  required,  viz., 
5(Pmax  "  P min)  *"•  C(M).  The  second  algorithm  is  iterative,  and  is  guaranteed  to  terminate  if  C(M)  >  0 
and  if  the  redundant  modulus  is  chosen  sufficiently  large.  Convergence  can  be  hastened  by 
increasing  the  degree  of  separability  of  the  core  or  by  increasing  the  redundant  modulus. 


We  begin  the  development  of  the  first  algorithm  with  the  following  lemma.  Recall  that  Pm/n  and 
Pmax  were  defined  in  Section  V,  and  that 
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Lemma.  Let  a  and  b  be  arbitrary  integers,  and  let  C(M)  >  0. 

(i)  If  C(a)  -  C( b)  >  AP,  then  a  >  b 

(ii)  If  /CM  -  C(b)  /  <  AP,  then  \a-b  /  <  2M  AP /  C(M) . 


Proof.  For  (i),  observe  that 


P 

max 


-P  .  sC(a)-C(6) 
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Proof,  (i)  and  (ii)  follow  immediately  from  the  definition  of  5n  and  from  C(0)  =  0. 

(iii)  Follows  since  Pm/„  +  a.C(M)  /  M  £  -n  for  all  a  €  S„. 

(iv)  Let  n  >  K  =  [  -PminJ.  then  n  5  K+1  >  -Pmin,  which  in  term  imalie  s  that  Sn  is  empty. 

Algorithm  for  search  of  Cm/n. 

Step  1.  Start  algorithm  with  initial  values  a»  =  e»  =  0,  n>  =  1,  and  Jj  =  Sj. 

Step  2.  Given  a,  with  C(aJ  =  Cjj|  -  nit  evaluate  the  core  of  successive  elements  of  the  set 
Jj  =  fa,  +  1,  aj+2,...,M}  n  S„,  until  either  (i)  an  integer  b  is  found  such  that  C(b)  ~  -K,  or  (ii)  an 
integer  b  is  found  such  that  -K  <  C (b)  <  c„  or  (iii)  C(x)  a  c,  for  all  x  €  7,-.  If  (i)  or  (iii)  occur  then  Cmin 
has  been  found.  Cmin  =  C(b)  in  case  (i),  and  Cmm  =  C(aJ  =  c,  in  case  (iii).  In  the  event  of  case  (ii). 
step  2  is  repeated  with  a,  +  i  =  b,  c,  +  f  =  C(b),  and  n,  *r  =  I  -  C(b). 

Example.  In  the  RNS  with  moduli  set  {7.9,11}  consider  the  core  function  with  weights  {-1.  -1,3}. 
Then  Pmin  =  -30/11,  K  =  2  and  £3  is  the  set  of  multiples  of  7  and  9.  From  Theorem  17,  max  S,  <  92, 
and  max  <  38.  The  first  iteration  of  step  2  evaluates  the  core  of  the  elements  in  the  set 
Ji  ={7,9, 14, 18, 21,..., 91}.  Since  C(7)  =  -1,  then  a  second  iteration  of  step  2  is  started  with  ai  =  7, 
c2  =  -1,n2  =  2,  and  J2  =  {9,14, 18.21, 27,28,35, 36}.  Since  C(9)  =  -2  =  -AC.  it  follows  that  Cmin  =  C(9)  =  -2. 


In  the  case  of  an  RNS  with  small  range  the  extrema  of  a  core  function  can  be  found  through  an 
exhaustive  search.  For  moderately  large  systems  an  exhaustive  search  may  prove  prohibitive.  In  the 
remainder  of  this  section  an  algorithm  is  presented  for  the  evaluation  of  the  minimum  of  a  core 
function.  The  authors'  experience  suggests  that  the  method  has  fast  convergence  even  for  an  RNS 
with  large  range.  The  algorithm  is  based  on  a  search  through  a  chain  of  dynamically  chosen  sets  with 
successively  smaller  ranges.  Since  C(0)  =  0,  the  algorithm  starts  by  searching  for  an  integer  in  the 
domain  of  the  RNS,  which  has  negative  core.  Since  downward  jumps  of  the  core  function  occur  only 
at  multiples  of  the  moduli  m,  for  which  w,  is  negative,  i.e.,  /  6  /.,  the  search  is  restricted  to  this  set. 
The  first  step  is  then  to  search  through  increasing  multiples  of  those  moduli  until  one  with  negative 
core  is  found.  If  this  occurs,  say  C(ai)  =  -  ni,  then  the  algorithm  goes  into  a  second  phase  with  a 
search  through  multiples  larger  than  a;  and  with  a  core  value  less  than  -nj.  If  such  an  integer  is 
found,  say  a2  with  C(a2)  =  -n2,  then  the  second  phase  is  restarted  replacing  aj  and  n,  with  a2  and 
n2.  The  sets  of  integers  for  which  the  core  is  less  than  or  equal  to  -n  forms  a  decreasing  chain,  so  the 
successive  phases  of  the  algorithm  take  place  in  successively  smaller  domains.  This  assures  a  fast 
stopping  rule  for  the  algorithm  and  convergence  to  Cm,„. 

Let  S3  s  {  a  /  a/  =  0  for  some  i  a  I.  }.  be  the  set  of  multiples  of  the  moduli  m;  for  which  w,-  is 
negative.  Let  Sn  =  (at  [ 0,M )  n  12  /  C  (a)  S  -  n  }.  for  n  >  0,  be  the  chain  of  sets  through  which  the 
search  takes  place.  The  search  algorithm  is  based  on  properties  of  the  S-chain  given  in  the  theorem 
below. 

Theorem  17.  The  chain  of  sets  S„  satisfy  the  following  properties: 

(i)  The  minimum  Cmin  is  attained  in  50. 

(ii)  The  chain  is  decreasing,  i.e.,  SnDSn>|. 

(iii)  The  maximum  element  of  Sn  is  bounded  above  by  -M.  (n  +Pmin)  /  C(M). 

(iv)  The  chain  terminates  after  K  steps,  where  K  =  [-PmuJ;  i.e.,  S„  =  0,  for  all  n  >  K. 


These  conditions  for  the  optimization  problem  arise  as  follows.  Condition  (i)  states  that  C  (M)  >  0. 
Condition  (ii)  is  equivalent  to  the  condition  C  (M)  odd,  required  for  parity  determination,  Theorem  4. 
Condition  (iii)  states  that  w,-  and  m,-  are  relatively  prime,  necessary  for  scaling  by  m,-.  Theorem  9.  This 
condition  is  equivalent  to  the  existence  of  a  solution  to  the  set  of  linear  equations 


where  p,y  are  the  prime  factors  of  m,- ,  and  x,y  and  y,y  are  integers  with  1  £  y,y  <  p,y.  The  restriction  (iv) 
follows  from  the  separability  condition,  Theorem  16,  where  K j  is  an  appropriate  constant,  e.g.,  Kj  = 
1/mi  where  /n;  the  smallest  modulus;  will  permit  lifting  and  descent.  Larger  values  for  K;  will 
permit  greater  separability.  Finally  the  left  hand  of  inequality  (v)  represents  the  largest  jump  the 
core  function  may  take.  This  condition  may  be  used  to  reflect  prior  restrictions  on  the  range,  e.g.,  if 
five  bit  representation  is  to  be  used  then  <2  =  32. 

The  optimization  problem  can  be  solved  by  standard  techniques  in  integer  programming  such  as 
branch  and  bound  and  enumeration  searches  [5],  [6],  Such  methods  were  used  for  the  case  of  the 
moduli  set  {23,  25,  27,  29,  31}  obtaining  a  core  function  with  weights  {-3,  -2,  4,  -1,  3}  (This 
function  is m-separable  for  all  m2 2). 

Algorithm  for  evaluating  core  function  extrema.  In  Theorem  15  bounds  are  given  for  Cm«  ar*d  Cnrn< 
which  prove  sufficiently  tight  for  closely  linear  core  functions.  In  many  cases  however  more  accuracy 
is  desired  and  so  exact  values  of  the  extrema  of  the  core  function  are  calculated. 

From  Theorem  3  it  follows  that  if  Cmin  is  attained  at  o,  then  Cmax  is  attained  at  (M  -  a  -  1),  and  that 
Cmax  ♦  Cmi„  -  C(M)  -  Ewj.  Hence  it  is  sufficient  to  evaluate  only  one  extreme  value  of  the  core 


function. 
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■ 
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include  all  those  required  for  the  algorithms  for  the  difficult  RNS  operations.  Since  the  range  bound 
given  in  Theorem  15  assumes  that  C  (M)  is  positive,  this  condition  is  required.  Parity  determination, 
scaling  and  m-separability  form  the  remaining  conditions.  The  separability  condition,  which 
measures  linearity  of  the  core  function,  works  counter  to  the  minimization  criterion.  In  each 
particular  application  this  trade  must  be  considered,  but  by  using  different  values  for  the  constant 
K j,  below,  complete  control  over  this  trade  is  obtained. 

The  question  of  finding  the  appropriate  core  function  can  now  be  stated  as  an  integer  optimization 
problem. 

A  practical  core  can  be  found  by  selecting  integers  wh wn,  which  minimize  the  functional 
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subject  to  one  or  more  of  the  following  linear  conditions  as  dictated  by  the  intended  application. 
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algorithm,  each  g,-  is  a  better  estimate  of  [aJb]  than  the  corresponding  g,-  of  the  first,  but  the 
sequence  {q,}  may  alternate  about  [ alb ].  Upon  convergence,  g  =  [alb]  orq  =  [aJb]  +  1;  in  either  case 
falb-ql<  1. 

Algorithm  for  Division  II.  Let  a,  b  €  [0,  M),  and  let  qo  =  0,  tq  =  a,  S0  =  1,  and  Sq  =  0. 

Step(i)  Verify  b  *0. 

Step  (ii)  Compare  r,  and  b.  If  r,  <  b,  then  g,  is  the  quotient. 

Step  (iii)  Let 

C(S,  (/-.)•  r. 

_ L  l  1 _ 

aSU2(SL(riy  b)'  SLiriY  b)  r°Unded 

qi  +  Sj  Si  + ; 

/a  -  bg,  +  7  / 
sgn(a-bqj  +  ,) 

It  is  dear  from  the  earlier  remark  (viz.,  lim  C(ka)/C(kb)  =  alb)  that  for  sufficiently  large  L,  this 
algorithm  will  converge.  Moreover,  increasing  the  redundant  modulus  will  cause  L  to  increase  and 
therefore  increase  the  rate  of  convergence.  Finally,  for  a  given  L,  the  algorithm  will  converge  faster 
for  approximately  linear  cores  than  for  highly  nonlinear  ones.  It  is  the  authors'  experience  that  for 
cores  which  are  sufficiently  linear  for  other  applications  (e.g.,  comparison),  the  algorithm  converges 
quickly. 

In  both  division  algorithms,  the  scaling  functions  S  can  be  implemented  by  iterated  table  lookup, 
ind  the  quotient  of  cores  for  the  second  algorithm  can  be  implemented  by  table  lookup. 


* 

n*i  = 
Gi  +  r  - 


S^l  =  SU2(SL(riy  bl 


We  next  discuss  the  computational  aspects  of  implementing  Algorithm  II  in  a  redundant  RNS.  We 
assume  SAP  <  C(M)  so  that  the  first  algorithm  for  comparison  can  be  used  (cf.  Section  VI).  The 

integers  a  =  fa, . an  +  i).  b  =  (fa . fa  +  i)  €  [  O.M )  are  given  by  their  residue  representatons.  For 

step  (i),  it  must  be  checked  that  not  all  components  fa... .fa  are  zero. 

For  step  (ii),  we  are  given  r,  =  fa; . qn  + 1)  The  cores  Cfa)  and  C(b-,)  are  computed  by  the  redundant 

modulus  method.  These  values  can  be  input  to  a  table  lookup,  which  signals  one  of  the  following 
conditions:  (a)  C(n)  -  C(b)  >  AP;  (b)  C(b)  -  C(n)  >  A?,  or  M/CfrJ  -  C(b)  /  <  AP.  If  (a)  holds,  proceed 
to  step  (iii)  of  the  algorithm.  If  (b)  holds,  the  algorithm  terminates.  If  (c)  holds,  r,  -  b  =  fa, -/Wand 
its  core  are  computed.  This  core  can  be  input  to  a  table,  and  the  decision  to  continue  or  terminate 
results. 

For  step  (iii),  we  are  given  g,-  =  (p,J  and  <?,•  =  ±  1.  The  core  Cfr,)  is  first  computed  and  input  to  a  table 
lookup  giving  StfaJ,  and  the  residue  product  Sj/Wt,  is  formed.  Then  C(sL(ri)rJ  is  found  and  input  to 
a  table  giving  Sj/Si/rJ  rj.  If  this  equals  1,  then  SL(rJ  =  sjtj);  otherwise  the  procedure  is  iterated  to 
find  StfrJ.  Generally,  only  one  or  two  iterations  are  needed.  Next,  the  residue  product  St(rj  b  is 
formed,  and  SLJSL(rj)  b)  is  formed  as  above.  Note  that  in  the  iterative  process  for  evaluating  the 
functions  SL  and  SL/J,  the  cores  C(SL(rj)  rj)  and  C(SLjSL(rj)  b))  have  already  been  computed,  being 
required  for  termination  of  the  iterative  evaluation.  Since  these  cores  are  small,  their  rounded 
quotient  (in  residue  form)  can  be  found  by  table  lookup.  This  quotient  is  then  multiplied  by 
(SL(r,)  b)  to  give  sf  +  ,  =  (a,). 

If  S,-  =  1.  Si  Sj  +  t  =  (a;),  and  if  <?,•  -  -T,  5,  s(  >,  =  fm,  -  Oj),  implemented  as  a  table  lookup.  The  new 
quotient  estimate  can  next  be  computed  as  a  residue  subtraction. 

Next,  a-bqi  +  i  is  evaluated  and  its  core  computed  via  the  redundant  modulus  method.  Its  core  is 
also  computed  by  the  second  corollary  to  Theorem  2.  If  these  results  are  equal,  a  -  bq(  +  ,  >.  0, 
otherwise  they  will  differ  by  C(M)  and  a  -  bg,*,  <  0.  This  gives  and  r,>,  is  found  by 

componentwise  m, -complementing  if  necessary. 

Example  8.  Estimate  a lb  =  1000000/3456  using  the  moduli  set  {23,25,27,29,31},  core  weights 
{-3, -2, 4,-1, 3},  and  redundant  modulus  fr?6=256.  System  constants  are  Pm,„  =  -6.76,  Pmax  =  5.76, 
AP *  12.51,  L  =  50,722,113.  For  clarity,  the  solution  is  given  in  decimal  form. 


Iteration  1. 


r0  =  1,000,000  (>b). 

SL(r0)  =  38 

SL(r0)-r0  =  38,000,000  with  core  183 
SL(r0)-b  =  131,328 
SWj(SL(r0)-b)  =  168 

SL,2(SL(r0)-b)  SL(r0)-b  =  22,063,104  with  core  108 
quotient  of  cores  =  2 
Si  =  qi  =  336 
6,  =  -1 

r,  =  161, 216  (>b) 

Iteration  2.  Si.(ri)  =  204 

51  (ri)*ri  =  32,  888,  064  with  core  159 
SL(r,)b  =  705024 

SL/J(SL{r,)-b)  =  22 

Si(J(5L(ri)-b)-Su(ri)-b  =  15,510,528  with  core  76 
quotient  of  cores  =  2 

52  =  44 
q2  =  292 
82-*1 

r2  =  9512  (>b) 

Iteration  3.  Su(r2)  =  3906 

SL(r2)-r2  =  35,747,712  with  core  173 
Su(r2)-b  =  13,499,136 

Si,A(r2)-b)  =  1 

Si_/2(SL(r2)*b)*SL(r2)*b  =  13,499,136  with  core  69 
quotient  of  cores  =  3 
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S3  =  3 
q,  =  289 
S3  =  1 
r}  =  1216 

Since  r3  <  b,  alb  =  q3  =  289.  To  4  decimal  digits,  alb  =  289.35 19. 

Statistics 

Three  simulation  runs  of  200  trials  each  were  done  to  test  the  division  algorithm  in  the  RNS  with 
moduli  set  {23,  25,  27.  29,  31}.  In  each  trial  two  numbers,  a  and  b,  were  generated.  The  first  one 
was  generated  at  random  from  the  complete  range  of  the  RIMS.  The  second  number  was  generated 
at  random  from  the  range  [ O.a )  for  RUN1,  the  range  [0,al1000)  for  RUN2,  and  the  range 
[0,al1000000)  for  RUIM3.  In  each  trial  two  sets  of  data  were  recorded,  the  absolute  error  between  the 
true  quotient  aJb  and  the  quotient  evaluated  by  the  algorithm,  and  the  number  of  iterations 
required  by  the  algorithm.  The  two  tables  below  give  some  summary  statistics  obtained  from  the 
simulations. 

STATISTICS  FOR 

ABSOLUTE  ERROR  RUN1  RUN2  RUN3 


% 

\ 

minimum 

1 

* 

25  percentile 

1 

50  percentile 

1 

75  percentile 

1 

maximum 

4  1 

A  similar  set  of  simulations  was  performed  using  the  first  division  algorithm.  As  with  the  present 
algorithm,  the  absolute  errors  never  exceeded  one,  but  had  a  higher  incidence  of  errors  above  one 
half.  The  statistics  on  the  number  of  iterations  show  that  the  first  algorithm  has  very  slow  rate  of 
convergence,  viz.,  30%  of  the  trials  required  over  20  iterations. 
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VIII.  CODING  METHODS 


The  purpose  of  this  hardware  study  was  to  investigate  the  feasibility  of  producing  a  collection  of 
VLSI  chips  which  might  be  used  to  implement  various  digital  signal  processing  algorithms.  A  chip  set 
would  be  made  up  of  the  basic  arithmetic  functions,  as  well  as  some  special  functions.  An  important 
special  function  is  the  operation  of  calculating  the  core.  The  design  and  fabrication  of  a  core 
calculator  chip  was  the  only  way  that  true  feasibility  could  be  demonstrated.  This  effort  was  made 
possible  by  the  advent  of  new  VLSI  computer  aided  design  technology.  With  the  aid  of  an  advanced 
silicon  compiler,  it  was  possible  to  exercise  several  design  options  in  a  relatively  short  period  of  time 
and  to  implement  the  final  design,  ail  with  relatively  low  cost.  The  major  design  considerations 
were:  multiple  operations  per  chip,  arithmetic  speed,  the  ability  to  use  relatively  large  moduli,  silicon 
area,  and  chip  delay. 


In  the  past,  nearly  all  RNS  implmentations  have  used  read  only  memories  (ROM's)  for  the  basic 
arithmetic  in  a  binary  coded  lookup  table  format  (  Huang,  1981;  Jenkins,  1977;  Jullien,  1980;  Polky, 
1982;  Soderstrand,  1981].  The  hardware  elements  have  also  been  commercially  available 
components  of  either  TTL.  ECL,  or  NMOS  technology.  A  typical  TTL  ROM  would  be  organized  as  a  1 K- 
4K  by  8  bit  memory,  where  a  single  chip  package  would  perform  each  arithmetic  operation.  These 
devices  typically  execute  their  operations  on  the  order  of  20-40  nsec  (TTL).  The  propagation  delay 
can  be  further  reduced  by  using  ECL  RAM's  (7nsec),  however  the  package  count  increases  by  eight 
times  since  these  devices  are  only  organized  as  IK  by  1-bits  [Huang,  1980]. 

VLSI  devices  fall  into  two  typical  categories,  single  operations  with  special  feature  or  more  complex 
devices  to  be  used  as  building  blocks  in  digitial  signal  processing  algorithms.  For  example,  Taylor 
[1982]  conceptually  designed  a  multiplier  device  for  large  (48-72  bits)  dynamic  range  operations, 
which  was  composed  of  a  modulo  2n  +  1  adder  and  a  PLA.  Other  conceptual  designs  have  taken  a 
similar  approach  [Jenkins,  1982]  for  general  purpose  adders,  subtractors,  and  multipliers,  all  using  a 
binary  adder  and  a  ROM.  Jenkins  suggests  that  in  addition  to  the  basic  operations,  a  standard  device 
could  be  implemented  to  perform  a  mixed  radix  conversion  kernel,  a  nested  polynomial  kernel,  and 
complex  arithmetic.  More  recently,  Jullien  [1983]  has  proposed  an  all  memory  structure  using  multi¬ 
look-up  table  modules  for  digital  filter  applications.  He  discusses  the  chip  area  and  delay  time  for 
various  layouts. 


An  example  of  a  single  function  8-bit  RNS  multiplier  has  been  presented  by  Ching  and  Johnsson 
[1983]  which  was  submitted  for  fabrication  by  the  OARPA  MOSIS  foundry  service.  They  concluded 
that  a  lookup  table  format  is  preferable  for  moduli  coded  with  less  than  4-bits,  while  an  array 
multiplier  architecture  would  be  better  for  moduli  greater  than  5-bits.  Their  analysis  considered 
both  4pm  and  0.5pm  feature  sizes,  with  the  final  design  implemented  in  4pm  NMOS.  Another 
more  complex  chip  design  was  realized  by  Yeh  et.al  [1983],  where  they  implemented  a  32-point 
Fermat  number  transform  for  FIR  filters. 

In  all  current  cases  of  VLSI  implemented  RNS  devices,  the  designs  have  dealt  with  either  creating  a 
real-time  alternate  for  a  time  consuming  single  functions,  such  as  large  dynamic  range 
multiplication;  or  with  some  special  purpose  function  such  as  the  number  theoretic  transform.  Our 
work  has  attempted  to  understand  better  the  potential  for  VLSI  implementation  of  more  general 
purpose  operation  which  would  be  applied  specifically  to  digital  signal  processing  problems.  The 
core  calculator  chip  would  be  the  foundation  of  a  building  block  collection  of  RNS  devices.  These 
components  could  then  be  used  with  the  aid  of  a  computer  aided  design  software  package  to 
develop  system  level  modules. 

Several  coding  methods  were  evaluated  with  regard  to  VLSI  implementation  and  their  effects  on 
silicon  area  and  propagation  delay  through  RNS  computational  devices.  Two  general  classes  of 
codes  for  residue  representation  are  redundant  and  nonredundant  codes  [Szabo,  1967].  It  is 
assumed  that  residue  codes  for  any  integrated  circuit  devices  will  be  of  binary  nature  however  with 
relationships  between  bits  depending  on  the  specific  type  of  code.  For  this  investigation  the 
nonredundant  code  was  selected  to  be  simply  as  fixed  weight  binary  code  commonly  used  in  all 
computer  systems.  The  redundant  codes  were  of  two  types:  1-of-m  position  code,  and  2-of-m 
position  code  for  a  modulus  of  m. 

The  1-of-m  code  is  made  up  of  m  binary  bits  with  m-1  zeroes  and  only  a  single  1  in  the  k*h  position 
of  the  word.  Residue  operation  with  this  kind  of  code  occurs  simply  as  a  permutation  of  the  position 
of  the  nonzero  bit  in  the  output  codes  based  on  the  respective  input  code  positions.  In  terms  of 
memory  usage,  this  type  of  code  is  inefficient  because  only  m  of  the  2m  binary  states  are  used. 
However,  it  was  thought  to  have  the  potential  for  requiring  less  control  logic  than  other,  more 
compact  codes. 


The  2-of-m  code  differs  in  that  two  bits  are  non-zero  rather  than  one.  The  primary  advantage  of  this 
code  is  that  it  requires  fewer  bits  to  represent  the  set  of  integers  in  any  residue  base.  The  2-of-m 
position  code  is  constructed  by  using  the  m-1  left-most  bits  to  represent  half  the  number  range,  with 
the  right-most  bit  indicating  which  half.  Consider  the  following  example,  for  modulo  7  arithmetic: 


digit  code 

0  0000 

1  00  10 

2  0  1  0L 

3  1  000 

4  100  1 

5  0  101 

6  00  1  1 


The  sets  [1,2,3]  and  [4,5,6]  are  similar  in  that  their  three  left-most  coded  bits  have  only  one  non-zero 
element,  whose  position  determines  the  difference  between  individual  numbers.  Another 
advantage  of  this  code  is  simple  determination  of  the  additive  inverse.  In  each  case  the  additive 
inverse  corresponds  to  the  number  which  has  the  same  most  left-most  bit  positions  (e.g.  1  ->  [001] 
and  6  -»  [001]).  The  2-of-m  code  has  properties  similar  to  a  conventional  2's  complement  code, 
where  the  most  significant  bit  is  use  to  determine  whether  a  number  is  positive  or  negative.  An 
example  would  be  +3  -*  [0011]  and  -3  -*  [1101].  A  disadvantage  of  the  2-of-m  code  is  that  its 
representation  for  different  moduli  is  not  the  same.  For  example,  the  integers  4  mod  5  and  4  mod  7 
are  represented  by  0101  and  1 100  respectively. 


A  modulo  7  adder  was  selected  as  a  model  device  for  comparing  these  three  codes  and  each 
programable  logic  array  (PLA).  A  PLA  represents  a  compact  arrangement  of  primitive  gates  which 
can  accomodate  a  variety  of  complex  logical  operations  with  efficient  use  of  silicon  area.  Figure  4 
illustrates  the  geometric  layout  of  a  typical  PLA.  The  cell  is  composed  of  rows  and  intersecting 
columns  for  both  the  AND  array  and  the  OR  array,  where  the  rows  also  connect  the  two  arrays 
together.  The  number  of  columns  is  related  to  the  number  of  input/output  terms  in  the  logic 
function.  A  standard  PLA  cell  is  composed  of  an  AND  matrix  and  an  OR  matrix,  which  are  linked  side 
by  side,  with  the  inputs  going  to  the  AND  array  and  output  leaving  the  OR  array.  A  PLA  implements 
a  canonical  set  of  sum  of  products  combinatorial  logic  functions  of  n-inputs  and  m-outputs.  The 
logical  structure  is  similar  to  that  of  a  read  only  memory  (ROM),  except  that  it  does  not  cause  as  many 
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Fig.  4.  Layout  drawing  for  programmable  logic  array  (PLA) 


unused  states  as  a  standard  read  only  memory  ROM  lookup  table  might.  This  occurs  because  a  ROM 
usually  has  a  preprogrammed  AND  array  which  allow  address  inputs  to  logically  select  a  single  m-bit 
data  value. 

The  Seattle  Silicon  Concorde1"1  silicon  compiler  was  used  to  generate  the  PLA  structures  for  the 
three  coding  techniques.  High  level  logic  equations  were  created  for  each  type  of  addition 
operation  and  were  then  optimized  by  a  software  logic  reduction  routine  to  produce  the  canonical 
sum  of  products  equations.  Figure  5  illustrates  the  logic  reduction  and  optimization  for  a  mod  7 
adder,  where  the  original  statement  of  the  adder  function  is  based  on  a  truth  table.  The  truth  table 
entries  are  in  radix  10,  while  the  final  logic  equations  are  for  a  binary  coded  residue  system.  In  the 
figure,  the  operands  are  A  and  B  with  C  being  the  result.  Logical  operations  of  AND  and  OR  are 
symbolically  represented  by  &  and  #  respectively.  The  logical  complement  is  indicated  by  the  ! 
symbol.  This  sum  of  products  form  of  logic  equations  is  easily  mapped  into  the  AND/OR  physical 
structure  a  PLA  module.  The  reduced  logic  equations  were  entered  into  the  compiler’s  input  stream 
for  further  processing.  Based  on  the  corresponding  minterms  for  each  equation  set,  the  compiler 
generated  the  PLA  cells.  A  graphical  display  monitor  was  used  to  investigate  the  geometric  nature 
of  each  PLA  and  to  determine  the  amount  of  silicon  area  which  would  be  occupied  by  each  cell.  The 
compiler  also  generated  additional  information  needed  for  gate  level  simulations  and  timing. 

The  resulting  silicon  areas  for  the  three  PLA  structures  is  summarized  in  Table  I.  These  results 
indicate  that  a  binary  coded  device  would  use  the  least  number  of  inputs,  outputs  and  columns.  The 
significance  of  the  number  of  columns  is  that  it  will  influence  the  propagation  delay  between  the 
two  arrays.  The  small  number  of  input/output  terms  is  due  to  the  compact  nature  of  conventional 
binary  coding.  However,  it  was  unexpected  that  the  silicon  area  and  the  number  of  PLA  columns  are 
also  less  than  other  code  implementations. 

The  propagation  delay  was  determined  by  evaluating  the  addition  of  10  random  numbers  in  each 
PLA.  The  average  delays  are  given  in  Table  II.  Again,  the  binary  code  exhibited  the  best 
performance,  which  is  consistent  with  the  general  rule  that  cells  with  the  smallest  area  have  the 
smallest  delays.  It  is  estimated  from  this  analysis  that  for  moduli  larger  that  7,  the  binary  code  will 
continue  to  be  the  most  efficient  implementation  method  for  VLSI  RNS  components. 


Table  I 


Sizing  of  PLA  structures  for  three  coding  methods 


1  of  m 

2  of  m 

binary 

number  of  input/output 

21 

12 

9 

number  of  columns 

49 

51 

36 

area  (sq.  mil) 

943 

630 

384 

Table  II 

Average  propagation  delay  for  the  addition  of  ten  numbers. 

1  of  m  2  of  m  binary 


propagation  time 
(nanoseconds) 


21.8 


15.4 


13.5 


truth__table  (  [  a  ,b  ]  -  >c ) 

[0,0] ->0 
[ 0 , 1 ] ->1 
[ 0 , 2 ] ->2 
[0,3]— >3 
1 0 , 4] ->4 
[0,5]— >5 
[0,6] ->6 


[1,0]->1 

[2,0]— >2 

U  ,U->2 

[2,1]— >3 

[1,2] ->3 

[ 2 , 2 ] ->4 

[1,3] ->4 

[2,3] ->5 

[1*4]— >5 

[2,4]— >6 

.  Il,5]— >6 

[2,5] ->0 

[1,6] ->0 

[ 2 , 6 ] -> 1 

[3,0]— >3 

[4,0]— >4 

[5,0]— >5 

[3,1]— >4 

[4,1]— >5 

[5,1]— >6 

[3,2]— >5 

[4,2] ->6 

[5,2]— >0 

[3,3]— >6 

[4,3] ->Q 

[5,3]— >1 

[ 3 ,4] ->0 

[4,4]— >1 

[5,4]— >2 

[3,5]— >1 
[3,6 I->2 

[ 4 . 5 ]  ->  2 

[4.6]  ->3 

[5,5]— >3 

[6,0] ~>6 
[6,1]— >0 
[6,2]— >1 

[ 6 . 3 ]  ->  2 

[6.4] — >3 

[6.5] — >4 

[6.6] — >5 


v\-\i 
*> ■ 

'•V-'.o 


Fig.  5.  Logic  reduction  for  a  mod  7  adder  showing  (a)  the  truth- 
table  source  and  (b)  the  reduced  logic  equations  for  a 
PLA  device. 


-..A 


(b)  Reduced  Equations: 

c2  -  ( ! b 2  &  bO  &  I  a 2  &  al  &  aO 

#  ( b 2  &  bl  &  !b0  &  a2  &  !al  &  aO 

#  ( ! b 2  &  bl  &  bO  &  !a2  &  aO 

#  (b  2  &  ! b 1  &  bO  &  a2  &  al  &  !aO 

#  ( b 2  &  bl  &  ibO  &  a2  &  al  &  laO 

#  ( !b2  &  I b 1  &  !bO  &  a2  &  !a0 

#  ( ! b 2  &  bl  &  !  a 2  &  al 

#  ( b  2  &  ! b 1  &  IbO  &  t  a  2  &  !aO 

#  ( ! b 2  &  IbO  &  a2  &  !al  &  laO 
t  ( ! b 2  &  I b 1  &  a2  &  !al 

#  ( b  2  &  IbO  &  I  a  2  &  lal  &  laO 

#  b2  &  Ibl  &  I  a2  &  lal))))))))))); 


cl  -  ( b  2  &  bl  &  IbO  &  I  a  2  &  al  &  aO 

#  (b  2  &  Ibl  &  a2  &  lal  &  aO 

#  ((bl  &  bO  &  lal  &  aO 

#  ( ! b 2  &  bl  &  bO  &  a2  &  al  &  laO 

#  ( Ibl  &  IbO  &  al  &  laO 

#  ( tb2  &  Ibl  &  Ia2  &  al  &  laO 

#  ( I b  2  &  Ibl  &  IbO  &  1  a  2  &  al 

#  (b 2  &  Ibl  &  bO  &  a2  &  lal 

#  (bl  &  IbO  &  lal  &  laO 

#  (  !b2  &  bl  £.  I  a  2  &  lal  &  laO 

#  ( Ib2  &  bl  &  IbO  &  I  a  2  &  lal 

#  I b  2  &  bl  &  bO  &  I  a  2  &  al  &  aO )))))))))))  ; 


cO  -  ( b  2  &  Ibl  &  bO  &  I  a  2  &  al  &  aO 

t  ( I b 2  &  bl  &  bO  &  a2  &  lal  &  aO 

#  ( ! b 2  &  IbO  &  I a2  &  aO 

#  (bl  &  IbO  &  a2  &  al  &  laO 

#  ( b  2  &  bl  &  IbO  &  al  &  laO 

#  ( b 2  &  IbO  &  a2  &  laO 

#  ( I b  2  &  bO  &  I  a  2  &  laO 

#  (b  2  &  Ibl  &  bO  &  a2  &  lal  &  aO 

#  ( ! b 2  &  Ibl  &  IbO  &  lal  &  aO 

#  (Ibl  &  IbO  &  I  a  2  &  lal  &  aO 

#  ( I b 2  &  Ibl  &  bO  &  lal  &  laO 

#  Ibl  &  bO  &  I  a  2  &  lal  &  I  aO )))))))))))  ; 


Fig.  5  (cont.) 


an  of  VLSI  Chip 


angement  of  cells  in  the  core  chip  corresponds  closely  with  the  arithmetic  elements  of  the 
nction,  fundamentally  an  inner  product.  PLA  cells  were  used  to  affect  the  multiplication 
on,  while  5-bit  binary  adders  were  used  for  a  chain  summation.  In  RNS  terms  the  chip 
s  a  residue  representation  a  =  (.01,02,0.3,04,05,05),  where  the  moduli  set  is 
1,25,27,29,31,32}.  The  chip  will  execute  the  function, 


C(o)  = 


s.a. 

I  l 


n  + 1 


the  terms  s,  are  the  fixed  coefficients  {7,24,6,20,30,9}. 


determined  that  a  modulus  of  32  would  be  an  acceptable  choice  not  only  for  response  related 
core  function,  but  by  selecting  a  power  of  two,  the  basic  arithmetic  elements  on  the  chip  could 
It  from  conventional  5-bit  binary  adders  and  PLA's.  .  Five  bit  binary  adders  were  selected  for 
>d  32  operations,  while  PLA's  were  selected  to  perform  the  fixed  coefficient  multiplies.  Figure 
trates  the  core  chip  floor  plan.  Six  residue  values  enter  the  chip  through  PLA's  wired  to 
m  the  set  of  fixed  multiplications,  followed  by  binary  adders  which  accumulate  the  totals  by 
;ummation  (mod  32),  with  the  resulting  core  value  produced  at  the  chip  outputs  . 


Design 


»  core  calculator  chip  design  two  standard  compiler  components  were  used:  the  PLA  and  the 
inary  adder.  The  PLA's  are  programmed  by  their  geometric  interconnections  to  act  essentially 
aders.  In  more  conventional  terms,  the  residue  values  represent  a  code  which  is  used  to  select 
M  possible  output  states.  For  example,  the  mod  23  PLA  would  produce  the  following  results: 


input  =  5 
input  =  9 


output  =  3  mod  32 
output  =  31  mod  32. 


0 


otherwise, 


;  +  £i  ai- 

multiplication.  Delta  multiplication  assumes  that  the  RNS  can  be  subdivided  into  two 
terns  with  ranges  M;  and  M2,  where  M  =  Mr-M2,  and  M1  =  M2  ~  \ /m.  The  value  of  c  is 
ted  by  the  product  of  the  integer  parts  of  a/Mr  and  b/M2,  and  adjusted  by  the  residues  of  a 
modulo  Mj  and  M2,  respectively. 

-  [a/Mr],  b*  =  [b/M2],  r[a)  =  jajMi,  r(b)  =  lb]M2-  Then  a  =  a*.M}  +  r(a),  b  =  b*.M2  +  r(b) 
1  c'  =  (a.b/M)  =  a*.b*  +  {b*.r(a)/Mr}  +  {a*.r(b)IM2}  +  {r(a).r(b)/M}.  In  the  last  expression 
oduct  a*.b*  is  directly  computable  in  the  RNS  since  a*  <  M2  and  b*  <  Mr.  The  next  two  terms 
e  divisions  by  M,  and  M2  respectively.  Defining  A(a)  =  [ r(a).b *  /  /W,  ],  A(b)  =  [ r(b).a *  /  M2  ], 
fr(a).b*jMi,  s(b)  =  / r(b).a*  / m2.  it  follows  that  c'  =  a*.b*  +  A  (a)  +  A(b)  +  d,  where 

da),  rib)  sia)  sib) 

d  =  -  +  -  +  - . 

M  Mx  M2 

lue  of  c  =  [c'J  is  evaluated  as  a*.b *  +  A  (a)  +  A  (b)  with  an  error  which  does  not  exceed  3. 

isidue  representations  of  r(a)  and  a*  are  obtained  in  two  steps.  First  the  representations 
1  the  subsystem  of  range  Mr  are  evaluated.  The  complete  residue  representations  are  then 
ied  by  using  the  extension  of  base  theorem.  The  subsystem  representation  of  r(a)  follows 
Jiately  since  it  is  identical  to  that  of  a.  That  of  a*  is  obtained  by  dividing  a-r(a)  by  Mr  in  the 


•notes  the  product  a.b  (mod  M),  then  a.b  =  c.M  +  e,  and  C(a.b )  =  c.C(M)  +  C(e).  If  C(M)  *  0, 
jsing  (A.1),  it  follows  that 

c  =  {6.C(a)  +  2ii).a.(6  -  p./m.)  +Eui.[a. Pj/m.]-  C(e) }  C(M) 

alue  of  c  can  then  be  evaluated  exactly  in  the  RNS  if  M  and  C(M)  are  relatively  prime.  This 
ithm  is  exact,  nevertheless  for  a  system  with  n  moduli  it  requires  several  core  evaluations,  n 
gs  (Theorem  9),  and  n  table  look-up  operations. 


y  Multiplication.  The  binary  algorithm  uses  the  binary  expansion  of  the  quotient  b/M,  and 
ates  c  as 


k 

^  e.  [  ct/2l  ], 
i=l 


(A. 2) 


e  k  =[log2  (M-1)],  and  the  bits  c,-  are  the  coefficients  in  the  binary  expansion  of  b/M.  The  error 
ng(1)  to  evaluate  c  is 


e.  — — 

*  2' 

t  is  bounded  above  by  k  -  1  +  (M/2k  ).  The  implementation  of  this  algorithm  consists  of  the 
ation  of  [a/2'];  the  evaluation  of  the  bits  c„  and  the  sum  in  the  right  hand  side  of  (A.2).  The  first 
requires  at  most  k  parity  checks  and  scalings  by  2.  The  binary  expansion  of  b/M  is  attained  by 
larisons  of  M  with  the  product  of  b  with  successive  powers  of  2.  The  algorithm  starts  with 
I  values  aa  sa,  b0  =  b,e0  =  cQ  =0,  and  then  uses  the  recursive  formulas  given  below.  The  process 
nates  when  a,  -  0,  which  will  occur  for  some  /  £  k. 


parity  of  a,-.; 


hi.,/21  =  (ai.rdi)/2 


PENOIX.  FLOATING  POINT  ARITHMETIC 


s  section  presents  the  application  of  the  notion  of  core  function  to  mantissa  operations  in  floating 
nt  arithmetic.  Akushskii  et  al  [1],  [21  discuss  three  multiplication  algorithms  and  one  division 
orithm:  core,  binary,  and  delta  multiplication,  and  binary  division.  All  four  algorithms  use  the 
e  function,  but  whereas  core  multiplication  is  based  on  a  formula  for  the  core  of  a  product  of  two 
egers,  the  other  algorithms  use  the  core  indirectly  to  perform  comparisons,  extensions  of  base, 
d  scalings. 

floating  point  arithmetic,  each  number  is  represented  by  an  exponent  and  a  mantissa.  The 
ponent,  an  integer,  can  be  represented  in  the  usual  way  as  a  power  of  2.  The  mantissa,  a  number 
tween  0  and  1,  is  represented  by  the  element  of  the  RNS  whose  quotient  to  the  range  of  the 
item  is  equal  to  the  mantissa.  More  precisely,  a  number  x.2k,  0  <  x<  1,  is  represented  by  the  pair 
k)  where  0  <  a<  M,  and  [x.M]  =  a. 

t  us  consider  two  integers  a  and  b  in  [0,M),  which  represent  the  mantissas  aJM  and  b/M 
spectively.  The  product  mantissa  (c'/M)  =  (a/M).(b/M)  is  represented  by  the  integer  part  c=[c']  = 
b/M].  A  general  description  of  each  of  the  algorithms  used  for  evaluating  the  product  c  are  given 
slow.  All  three  approaches  are  then  applied  to  an  example,  which  show  in  detail  how  each 
jorithm  is  implemented  in  residue  arithmetic  with  core  functions.  Since  the  same  example  is  used, 
is  permits  us  to  compare  the  algorithms  in  terms  of  complexity  and  speed. 

ire  Multiplication.  The  core  multiplication  algorithm  is  based  on  a  formula  for  the  core  of  a 
oduct  of  two  integers: 


8  made  available  for  laboratory  evaluation.  After  the  performance  limitation  of  these  devices  is 
etermined,  the  resulting  insights  will  be  applied  for  further  evaluation  of  speed  enhancements, 
nailer  line  widths  (e.g.,  1pm),  and  more  input/output  pins. 

1  addition  to  the  specific  research  discussed  above,  a  more  general  effort  will  be  pursued  to 
westigate  similar  design  techniques  applied  to  devices  for  division  and  other  floating  point 
perations.  A  computer-aided  engineering  system  will  be  used  to  simulate  the  performance  of 
arious  RNS  devices.  Software  simulation  tools  will  allow  a  hierarchical  approach  to  the  overall 
esearch  effort  and  thereby  provide  greater  insight  into  the  high  level  functional  behavior  of 
lOtential  subsystems  for  a  real-time  signal  processing  system. 

peed  and  hardware  requirements  should  be  evaluated  for  specific  architectures  using  cores  and/or 
he  fractional  RNS  for  realistic  problems.  Simulations  of  critical  subprocesses  using  a  computer  aided 
lesign  engineering  workstation  should  be  performed.  Specific  algorithms  of  interest  to  the  Ocean 
iurveillance  Signal  Processing  program  should  be  determined  as  benchmark  algorithms  for  the 
‘valuation  process.  Given  an  algorithm,  an  RNS  computational  architecture  could  then  be 
letermined  and  the  functional  layout  of  the  integrated  circuits  and  the  data  flow  paths  would  be 
pecified.  Execution  times  and  hardware  requirements  could  then  be  evaluated  and  compared  with 
hose  predicted  using  a  more  conventional  architecture. 


The  current  algorithms  of  Gregory  and  Matula  are  computationally  complex  and  require  high- 
precision  calculations  in  a  weighted  number  system,  in  part  defeating  the  primary  benefit  of  using 
an  RNS.  Decoding  procedures  performed  within  a  usual  (not  fractional)  RNS  itself  will  be 
investigated.  One  approach  would  be  to  recast  Gregory  and  Matula's  algorithm  into  the  RNS; 
however,  the  difficulty  here  is  requirement  for  integer  division  in  repeated  applications  of  the 
Euclidian  algorithm.  If  a  core  could  be  derived  which  is  'nearly  linear",  then  the  integer  divide 
might  be  efficiently  performed  using  the  core  to  make  successively  more  accurate  estimates  of  the 
quotient. 

Gregory  and  Matula  have  also  developed  a  yet  unpublished  alternate  fractional  RNS.  Instead  of 
using  n-tuples  of  ordered  pairs  to  represent  a  fraction  in  an  n-modulus  RNS,  n-tuples  of  integers  are 
used  together  with  a  formal  symbol.  The  formal  symbol  will  appear  in  the  kth  position  of  the 
representation  of  a  fraction  if  the  denominator  is  divisible  by  the  k*h  modulus.  Arithmetic 
operations  are  then  defined  on  the  residues  together  with  this  formal  symbol.  This  system  will  be 
examined  in  detail  for  application  to  signal  processing  problems,  and  new  algorithms  to  alleviate  the 
decoding  bottleneck  will  also  be  investigated. 

Having  developed  more  efficient  approaches  for  the  operations  of  decoding  and  magnitude 
comparison,  the  computational  complexity  of  the  fractional  RNS  on  realistic  problems  will  be 
evaluated.  Comparisons  with  other  RNS  and  more  conventional  architectures  will  be  performed. 

Research  should  continue  toward  investigation  of  VLSI  techniques  applied  to  special  purpose  RNS 
devices.  Further  analysis  for  functional  performance  and  execution  speed  of  a  core  calculation  chip  is 
required.  These  efforts  could  be  performed  with  the  aid  of  a  silicon  compiler  CAD  system.  By  using 
such  a  method,  considerable  insight  can  be  gained  regarding  the  detailed  physical  layout  and  logical 
performance  of  a  conceptual  RNS  processor  chip.  By  taking  advantage  of  a  low  cost  shared-chip 
fabrication  process,  a  small  number  of  devices  can  be  produced  for  further  research  of  real  chip 
performance.  It  is  expected  that  both  a  core  calculation  chip  and  a  multi-modulus  adder  chip  would 


Many  other  applications  of  the  core  function  may  exist,  and  new  and  even  simpler  algorithms  for  the 
difficult  residue  operations  may  be  possible.  First,  analysis  of  convergence  properties  of  the  new 
division  algorithms  should  be  performed.  It  is  expected  that  further  refinements  of  the  algorithms 
will  yield  better  convergence  properties.  Second,  new  overflow  detection  methods  for  both 
addition  and  multiplication  are  probably  possible  for  suitably  linear  core  functions.  Next,  analysis  of 
appropriate  integer  programming  techniques  for  computation  of  optimal  core  functions  is  required. 
Improvement  of  the  floating  point  division  and  multiplication  algorithms  of  Akushskii  et  al.  should 
be  investigated.  Possibly,  core  functions  could  be  utilized  to  attain  a  unique  representation  of  a 
floating  point  number.  Next,  the  analysis  of  the  Soviet  RNS  literature  for  this  project  has  been  so 
fruitful  that  further  efforts  in  this  direction  are  required.  Finally,  a  variety  of  other  uses  for  the  core 
will  undoubtabiy  arise;  for  example  core  functions  may  be  applied  to  fault  tolerant  systems. 

In  the  course  of  the  current  project,  the  fractional  RNS  concepts  of  Gregory  et  al.  [11]  were  surveyed 
for  their  application  to  real  time  signal  processing.  In  a  fractional  RNS  a  mapping  is  defined  from  a 
subset  of  the  rationals  to  a  residue  system  which  permits  divisions  as  fast  as  multiplications  and 
additions.  The  work  was  developed  for  exact  arithmetic  in  poorly  conditioned  linear  algebra 
computations.  Consequently,  it  was  implemented  on  a  general  purpose  computer  using  extremely 
large  moduli  and  an  extremely  large  dynamic  range  (on  the  order  of  hundreds  of  bits).  This  work  is 
not  only  mathematically  elegant,  but  it  may  have  great  potential  for  the  smaller  dynamic  range 
problems  found  in  signal  processing.  In  this  setting,  encoding  into  the  system  and  all  the  elementary 
arithmetic  operations  (including  division)  are  efficient  and  can  be  implemented  in  a  single  clock  cycle 
using  custom  or  semi-custom  hardware.  The  operations  of  magnitude  comparison  and  decoding 
back  to  the  fractional  representation  are  currently  quite  difficult,  and  the  development  of  new 
algorithms  for  these  operations  is  expected  to  permit  the  application  of  the  fractional  RNS  to 
complex  signal  processing  problems. 


IX.  CONCLUSION  AND  FUTURE  WORK 


This  paper  has  presented  the  core  function  of  Akushskii,  Burcev,  and  Pak  and  has  extended  their 
theoretical  results  to  render  the  core  function  a  practical  solution  to  the  difficulties  of  residue 
arithmetic.  Specifically,  this  paper  has  introduced  the  use  of  a  redundant  modulus  for  efficient  core 
calculation  and  has  presented  algorithms  for  all  of  the  traditionally  difficult  residue  operations 
including  comparison  and  general  division.  In  addition,  new  results  were  presented  which 
reformulate  the  problem  of  selecting  the  core  coefficients  w(-  as  an  integer  programming  problem, 
so  that  optimal  cores  may  be  obtained.  High  speed  VLSI  circuits  for  implementing  the  core  function 
and  other  multi-stage  RNS  computations  have  been  judged  feasible. 
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In  the  past,  the  residue  number  system  has  been  well  matched  to  the  performance  of  linear  signal 
processing  algorithms  for  which  very-high  speed  execution  was  critical.  With  the  results  presented  in 
this  work,  it  now  appears  that  the  RNS  may  be  suited  for  high  speed  performance  of  nonlinear 
arithmetic  and  complex  logical  operations.  These  efforts  have  been  very  positive,  both  in  the 
solutions  of  some  theoretical  problems  and  in  leading  to  physically  realizable  RNS  core  calculation 
integrated  circuits.  By  coupling  the  parallelism  of  the  RNS  with  the  positional  capabilities  of  the  core 
function,  RNS  computational  architectures  offer  high  potential  for  next  generation  processors  for 
both  linear  and  non-linear  real  time  processing. 
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Further  research  is  required  before  this  potential  can  be  realized  for  practical  data  processing 
systems.  Four  areas  have  been  identified  for  continued  investigation.  These  include  (i)  further 
analysis  of  the  applications  of  the  core  function;  (ii)  investigation  of  fractional  RNS  systems;  (iii) 
conceptual  design  and  analysis  of  RNS  VLSI  devices;  and  (iv)  development  to  facilitate  applications  of 
this  technology  to  signal  processing. 
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complete  layout  drawing  for  the  core  calculator  chip  showing 
cells,  binary  adder  cells  and  the  pod  ring. 


The  PLA  outputs  feed  five  5-bit  binary  adders,  whose  outputs  are  then  connected  to  the  inputs  of  the 
next  stage.  Since  the  addition  of  all  arithmetic  is  done  mod  32,  this  chained  arrangement  facilitates 
data  flow.  Figure  7  represents  the  silicon  design  for  the  adders,  where  the  nature  of  the  geometric 
patterns  is  similar  to  the  previously  described  PLA  cell. 

Figure  8  illustrates  the  working  area  of  the  core  calculator,  whose  dimension  is  70  x  100  mils.  (This  is 
smaller  than  a  single  cell  for  a  1-of-m  or  2-of-m  position  code).  Figure  9  shows  the  complete  chip, 
including  an  input/output  pad  ring,  30  inputs,  5  outputs  and  one  power/ground  pair,  with  a  total 
area  of  155  x  200  mils. 

Simulation 

The  core  calculator  design  was  tested  using  RNL,  a  timing  logic  simulator  for  digital  circuits,  which 
uses  a  resistive  model  of  transistors  to  implement  a  logic  level  simulations.  This  simulation 
completely  checks  the  functionality  of  the  circuitry,  and  has  been  determined  accurate  from  previous 
research  to  be  within  20  percent  of  the  performance  measured  for  devices  in  actual  fabrication. 

The  entire  core  chip  was  evaluated  by  processing  a  set  of  test  vectors  which  were  selected  as  the 
numbers: 

1835870  10545965  6401369  7435820  3056008  9474898 

13045486  5352529  11597763. 

This  set  was  encoded  by  each  modulus  to  produce  the  input  residues  for  the  chip  input  pins,  and  the 
resulting  core  values  from  the  chip  simulation  were  compared  with  the  analytical  values.  A 
successful  match  between  these  results  was  used  as  a  strong  indication  that  the  chip  would  function 
properly  after  fabrication.  A  sample  calculation  would  be  similar  to  the  following: 


The  silicon  representation  of  the  standard  PLA  cell  is  shown  in  Fig.  4.  The  design  shows  the  various 
layers  (metal,  poly  silicon,  diffusion,  etc.)  of  the  CMOS  p-well  fabrication  process.  As  shown  in  the 
Figure  ,  the  PLA  is  surrounded  by  Vdd  and  GND  metal  traces.  The  inputs  to  the  PLA  occur  at  the  five 
contacts  located  in  the  lower  left  hand  corner.  The  input  signals  proceed  north  through  buffers 
where  the  drive  capacity  and  propagation  delays  of  the  signal  can  be  controlled. 

The  signals  exit  the  buffers  and  enter  the  plane  of  the  PLAs  ,  which  consists  of  minterm  rows 
representing  the  binary  form  of  the  input  vector.  A  modulo  31  PLA  has  31  rows  of  minterms, 
representing  inputs  of  0-30.  Physically  these  minterms  are  a  series  of  transistors  that  form  logic  gates 
corresponding  to  the  PLA  code  file.  For  example,  the  code  file  for  row  2  of  this  PLA  is 


000  1  0 


which  is  the  binary  representation  of  the  number  2.  Silicon  transistors  are  formed  in  row  2  such  that 
if  the  PLA  input  is  equal  to  2,  then  the  output  will  correspond  to  the  value  stored  on  the  right  hand 
side  of  row  2,  which  forms  the  OR  plane  of  the  PLA. 


Thus  the  right  hand  side  of  row  2  has  transistors  which  represent  the  binary  code  for  28,  or 


11100. 


The  result  of  an  input  signal  equal  to  2  produces  an  output  signal  equal  to  28,  which  is  the  correct 
response  for  this  PLA.  The  electronic  input  signals  pass  from  the  input  contacts  through  the  proper 
minterm  and  finally  through  the  output  buffers  to  the  output  contacts  for  wiring  connections  to 
other  parts  of  the  core  calculator. 


The  PLA  in  Figure  4  multiplies  the  input  signal  by  30  and  then  outputs  the  product  modulo  32.  For 
the  above  example,  an  input  of  2  will  yield  an  output  of:  2  x  30  s  28  (mod  32). 
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subsystem  of  range  M2  and  then  applying  extension  of  base  to  the  subsystem  Mj.  Similar  steps  are 
used  to  evaluate  the  rest  of  the  terms. 


Example  9.  Consider  an  RNS  with  moduli  {3, 5,7,1 1}.  The  core  function  has  weights  {-1,2, -1,1},  and 
hence  C(M)  =  C(1 1 55)  =  17.  The  orthogonal  basis  elements  are  B1  =  385,  =  231,  Bj  =  330  and 

fl4  =  210,  with  cores  C(B,)  =  6,  C(B2)  =  3,  C(B3)  =  5,  and  C(B4)  =  3.  The  core  has  a  minimum  of  -2 
and  a  maximum  of  18,  and  so  0,1,1 5,  and  16  are  critical  cores. 

In  the  present  example  the  multiplicands  are  a  =  (0,1, 5, 8)  with  core  C(a)=  1,  and  b  =  ( 2, 4,6, 9)  with 
core  C(b)  =  13.  The  value  of  c  =[a.b/M]  is  evaluated  using  the  three  algorithms  described  above.  The 
decoded  values  of  the  operands  and  quotients  as  well  as  the  errors  are  discussed  following  the 
example. 


(i)  Core  Multiplication. 

When  formula  (*)  is  applied  several  of  the  terms  are  easily  computed.  First  b.C(a)  =  (2,4,6, 9)  x  1  = 
(2, 4,6,9).  The  third  term  may  be  computed  using  a  table  look-up  to  evaluate  each  of  the  summands 
wi  (a, Pilmj],  and  then  adding  to  obtain  (2, 2, 2, 2).  The  remainder  e  =  /a.b/M  =  (0,4,2,6)  has  core  C(e) 
=  6,  and  so  -C(e)  =  (0,4, 1,5).  The  division  by  C(M)  =  17  is  implemented  by  multiplying  by  its 
reciprocal  (1/17)  =  (2, 3, 5, 2),  which  is  a  system  constant.  The  first  sum  is  the  only  remaining  term  and 
is  the  most  demanding.  It  requires  one  scaling  of  b  per  modulus,  e.g.,  for  the  first  modulus 
b-b;  =  (0, 2,4,7)  with  core  C(b-bj)  =  13.  The  residue  representation  of  (b-bj)  /  mj  in  the  subsystem  of 
moduli  {5,7,1 1}  is 


6~  m\  _  (2,4,7) 
m,  (3,3,3) 


=  (2,4.7)*  (2,5,4)  =  (4,6,6). 


Using  Theorem  9  to  scale  b  by  it  follows  that  the  remainder  of  (b-bi)  /  m>  with  respect  to  mf  is 
2.  Repeating  this  process  for  each  of  the  rest  of  the  moduli  one  obtains  (b-b2)/m2  =  (2, 3, 6,1), 

(b-b3)/m3  =  (2, 4,1, 2), (b-b4)/m4  =  (1,0, 1,8), and  soS  w/a; (b-0, /m, ,)  =  (2, 1,1,1). 


W  , 


m. 


Formula  (A.  1)  is  now  applied  to  calculate  c  =  {(2,4, 6, 9)  +  (2,1,1, 1)  +  (2,2,2, 2)  +  (0, 4,1,5)}  x  (2, 3,5,2) 
=  (0,1, 3,6)  x  (2,3,5, 2)  =  (0,3, 1.1). 


(ii)  Binary  Multiplication 

In  this  example  k  =  [  /og*  (M-1)]  =  10,  so  the  algorithm  will  take  at  most  10  steps.  The  results  of 
applying  the  iterative  steps  of  the  algorithm  are  shown  in  Table  1.  The  parity  d,-  of  a,-.;  is 
determined  using  Theorem  4  which  requires  the  evaluation  of  the  core  of  a,-  and  the  parity  of  its 
residue  components.  The  bit  £j  is  determined  by  means  of  Theorem  7  which  requires  two  core 
calculations  and  a  comparison.  In  Table  1  six  iterations  are  shown  since  ay  -  0.  The  last  entry  for.the 
c,  col  umn  gives  c  =  (0,3, 1 , 1 ). 


TABLE  1 


•  d, 

Oi 

b, 

ti 

c, 

0 

(0.1.  S.8) 

(2.4.6.91 

0 

(0.0.0.0) 

1  0 

(0.3.6.4) 

(1. 3.5,7) 

1 

(0.3.6.4) 

2  0 

(0,4,3, 2) 

(2.1.3.3) 

1 

(0,2, 2.6) 

3  0 

(0.2,5, 1) 

(1. 2.6,6) 

0 

(0.2.2.6) 

4  0 

(0.1. 6.6) 

(2.4.5, 1) 

1 

(0.3.1. 1) 

5  0 

(0.3.3.3) 

(1.3.3.2) 

0 

(0.3,1. 1) 

6  1 

(1. 1.1.1) 

(2,1, 6.4) 

0 

(0.3.1, 1) 

7 

0 

- 

- 

- 

(iii)  Delta  Multiplication 

Since  A4=1155,  the  subsystems  the  complete  system  is  subdivided  into  are  {3,11}  with  range 
Mf  =  33,  and  {5,7}  with  range  My  =  35.  For  extension  of  base  calculations  core  functions  are  defined 
on  each  of  the  two  subsystems:  C/(a)  =  [a/3]  -  [a/1 1],  and  Q(a)  =  -  [a/5]  +  3[a/  7J. 
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Since  a  =  (0,1, 5,8)  it  follows  that  r(a)  =  (0,8)  in  the  first  subsystem.  Using  the  extension  of  base 
theorem  and  Ct(r(a))  =  8,  one  obtains  that  rfaj  =  (0,0,2, 8)  in  the  complete  system.  The  value  of  a*  is 
first  evaluated  modulo  M2  in  the  second  subsystem. 


a*  = 


a  —  r(a) 

M, 


(1,5) -(0,2) 
(3.5) 


=  (l,3)x(2,3)  =  (2,2), 


and  then  since  C2(a*)  =  0,  by  extension  of  base,  obtain  a*  =  (2, 2, 2, 2). 

Similar  calculations  lead  to  r(b)  =  (1, 4,6,1 ),  b*  =  (2,1, 5, 4),  A  (a)  =  (2,3,2, 1)  and  A  (b)  =  (1,1, 1,1). 
From  these  c  is  evaluated  as  c  =  (2,2,2, 2)  x  (2, 1,5, 4)  +  (2,3,2, 1)  +  (1,1, 1,1)  =  (1,1,6,10). 


The  first  two  algorithms  evaluate  c  =  (0,3, 1,1),  which  corresponds  to  the  integer  78.  In  fact  since 
a  =  96  and  b  =  944,  the  exact  value  of  [c]  is  78.  Delta  multiplication  evaluates  c  =  (1,1,6,10)  which 
corresponds  to  76,  i.e.,  an  error  of  2  units. 


The  core  algorithm  is  exact  but  it  is  the  most  intensive  in  computations.  The  delta  algorithm  requires 
eight  extensions  of  base,  independently  of  the  size  of  the  moduli  set,  and  assures  a  maximum  error 
of  less  than  3.  The  binary  algorithm  is  in  general  the  simplest  and  of  faster  convergence,  but  allows 
the  largest  error,  e.g.,  if  in  the  same  RNS  one  considers  the  case  where  a  =  1023  and  b  =  1154, 
then  c  =[a.b/M]  =  1022,  but  the  binary  algorithm  cafcualtes  c  =  1013,  i.e.,  an  error  of  9. 


Division  Algorithm.  The  binary  division  algorithm  for  floating  point  arithmetic  is  similar  to  that  for 
binary  multiplication.  It  combines  a  binary  expansion  of  a  number  in  [0,1)  with  the  quotients  and 
remainders  obtained  in  dividing  the  range  of  the  RNS  by  successive  powers  of  two. 


Let  a  and  b  be  two  integers  in  [ 0,M ],  a  <  b,  which  represent  the  mantissas  a!M  and  b/M 
respectively.  The  quotient  mantissa  (q'  /  M)  =  (a  /  M)  /  (b  /  M)  is  represented  by  q  in  [0,M),  where 
q  =[M.a/b). 


The  value  of  g'can  be  evaluated  as  the  sum  of  the  terms: 


i.n.'  l.V"  '.'-V-V  V’-' '™'! I'l '.I  i  «■ 


«%  =  I  (C,/2')^, 


1=1 


E.  =  M.  y  e./2‘ , 
i>k 


where  the  e,  are  the  bits  in  the  binary  expansion  of  alb,  i.e., 


(a/6)=  y  e./2i\ 
i  =  1 

k  =  [log2M];  and  7/  and  S,  are  the  quotients  and  remainders  obtained  when  dividing  M  by 
successive  powers  of  2,  i.e.,  Tj  =  [M/2<],  Sj  =  j M  j2i. 


The  bits  c„  as  shown  below,  can  be  evaluated  within  the  RNS,  by  successive  multiplications  by  2  and 
comparisons  with  b.  The  sum  Qi  only  involves  the  bits  and  the  system  constants  T„  and  so  it  is  also 
computable  in  residue  arithmetic.  Applying  this  direct  approach  to  the  sum  Q*2  proves  not  useful 
since  S,-  <  2'  and  hence  the  sum  would  be  computed  as  Q*2  =  0.  A  more  accurate  procedure  is 
obtained  by  first  evaluating  this  sum  with  fl/f.S/  instead  of  S;,  and  then  dividing  the  result  by  M,  i.e., 
Q*2  is  rewritten  as  the  sum  of 


<?2  = 


1  V 

—  y  c.u 

M  ‘  1 


i  =  l 


1=1 


*,=  £  Is-",.-*, 

1=1 


. 


A-: 
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where  l/,  =  [M.Sj/2'],  and  V;  =  / M.5,  / 2i. 


The  value  of  q  =  [a.b  I  M  7  is  evaluated  as  the  sum  Qj  +  [027.  where  El,  E2,  and  £3  contribute  to 
define  the  error  in  the  evaluation  of  q.  This  error  can  be  shown  not  to  exceed 

1  +  (  Ml  2k)  +  (k-1  +2k)IM<4. 

The  algorithm  presented  by  Akushskii  et  al.  can  be  divided  into  three  parts. 

(1)  Evaluation  of  the  system  constants  (precomputed) 
k  =  [log2M], 


Tj  =  [MI2'  ],  i  =  1 . k, 


U,  =  [M.  I M 1 2i  /  2'  ] ,  i  =  1 . k. 


C  (U,)  =  core  of  Uj. 


(2)  Evaluation  of  the  binary  expansion  of  alb.  The  following  recursive  scheme  is  used:  starting 
with  initial  values  of  aQ  s  a  and  e0  =  0,  the  recursive  step  is 


a,  =  2  (a,.,  -  b. £/.}), 

|  1  ifaj>b, 


£< 


0  otherwise. 


(3)  Evaluation  of  the  sums  used  to  evaluate  the  quotient  q. 


A 

q,  =  y  t  t. 
1  1 1 
i=i 
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Example  10.  Consider  the  RNS  with  moduli  {7,9,1 1},  and  core  function  with  weights  {-1,-1, 3}.  Let 
a  =  (3,6,10)with  C(a)  =  0,  and  fa  =  (0,2,8)  with  C(b)  =  5,  represent  two  mantissas. 

The  residue  representation  of  T„  Uj,  and  their  cores  are  shown  in  Table  2. The  iterations  of  step  (2), 
are  shown  in  Table  3.  Each  column  corresponds  to  an  iteration  starting  with  the  column  of  initial 
values.  The  row  entries  are  evaluated  as  follows 

iffii-r  =  0 

ifej.j  =  1 

(ii)  C(ai)  evaluated  using  Theorem  2  for  the  core  of  2. a,-  or  2. (a,- fa),  depending  on  the  value  of 

(iii)  a, -fa  computed  in  residue  arithmetic 

(iv)  C(dj-b)  computed  using  Theorem  2  for  the  core  of  a  difference, 

(v)  C(a,-fa)  computed  using  Theorem  13,  the  Core  Chinese  Remainder  Theorem, 

(vi)  binary  bit:  e,-  =  1  or  0  depending  on  whether  the  two  core  values  given  in  (iv)  and  (v) 

coincide  or  not. 
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